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About This Document 


This document describes language extension specifications that allow software developers to access hardware 
features that are not easily accessible from a high level language, such as C or C++, in order to obtain the best 
performance from a Synergistic Processor Unit (SPU) and a Power Processing Unit (PPU) of the Cell Broadband 
Engine” (CBE). This document also includes function specifications to facilitate communication between SPUs and 
PPU, and it lists a minimal set of standard library functions that must be provided as part of a standard SPU 


programming environment. 


Audience 


This document is intended for system and application programmers who want to write SPU and PPU programs for a 


CBEA-compliant processor. 


Version History 


This section describes significant changes made to each version of this document. 





Version Number & Date Changes 








v. 2.3 
December 4, 2006 


Corrected the function parameter ordering of the PPU st wbrx instrinsic 
(TWG_RFC00074-0: CORRECTION NOTICE) 


Corrected the type of element initializers used to initialize a vector of 
signed/unsigned char (TWG_RFC00075-0: CORRECTION NOTICE) 


Changed to note that the use of double-precision contracted operations is 
permitted by default unless prohibited by the FP_CONTRACT pragma or the 
no-fast-double compiler option (TWG_RFC00076-0). 


Added PPU data types and programming directives to Chapter 1, and 
changed title from “SPU Data Types and Program Directives” to “Data Types 
and Programming Directives” (TWG_RFC00077-1). 


Removed the fre, frsqrtes,and _ popcntb intrinsics, and added 
the — frsqrte intrinsic (TWG_RFC00078-3). 


Added that support is provided in the floating-point environment for both 
double-precision elements and all four single-precision elements. Also, 
updated information for FLT_ROUNDS (TWG_RFC00079-1). 

Added a new chapter, “PPU VMX Intrinsics”, that specifies a set of intrinsic 
functions making the underlying PPU VMX instruction set accessible from the 
C programming language (TWG_RFC00081-1 and TWG_RFC00092-0). 
Added 32-bit ABI support to the PPU intrinsic functions, changed function 
arguments to provide a consistent high-level interface, and corrected several 
typographical errors (TWG_RFC00083-1). 

Changed the return type ofthe =fctiwand — fctiwx PPU intrinsic 
functions , changed the descriptive names of these and other similar 
conversion intrinsics, and removed the _ st fiwx intrinsic function 
(TWG_RFC00089-1). 

Identified deprecated PPU VMX operations and recommendations for suitable 
PPU intrinsic function alternatives (TWG_RFC0O0090-0). 

Identified non-supported language features and specified that C++ exception 
handling is not supported on the SPU (TWG_RFC00091-0). 


Applied corrections: TWG_RFC00086-0 and TWG_RFC00087-0. 
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Version Number & Date Changes 








v. 2.2 
October 11, 2006 


v. 2.1 
October 20, 2005 


v. 2.0 
July 11, 2005 


Applied the changes made in the following requests: TWG_RFC00056-0, 
TWG_RFC00057-0, TWG_RFC00058-2, TWG_RFC00061-1, 
TWG_RFC00060-1, TWG_RFC00062-0, TWG_RFC00066-2, 
TWG_RFC00067-2, TWG_RFC00068-0, TWG_RFC00070-1, 
TWG_RFC00072-0, and TWG_RFC00073-0. 


Changed document title because its contents are no longer limited to the 
SPU. Changed the sections “About this Document” and “Audience” 
accordingly. Applied TWG_RFC00053-0, TWG_RFC00054-1, and 
TWG_RFC00055-0. 


Replaced uses of a protected name by references to the document Altivec 
Technology Programming Interface Manual per TWG_RFC00050-1 and 
TWG_RFC00052-0. 


Corrected several operand errors related to spu_ sub, which is the arithmetic 
intrinsic for vector subtraction (TWG_RFC00046-0: CORRECTION NOTICE). 


Corrected various documentation errors; for example, changed sample code 
demonstrating how to restore the Stack Pointer Information register as a 
result of invoking the longjmp function (TWG_RFC00047-0: CORRECTION 
NOTICE). 


Specified that alternate vector syntax for vector literals is optional rather than 
mandatory (TWG_RFCO00050). 


Added a sub-section called “Malloc Heap” to the C library section of the “C 
and C++ Standard Libraries” chapter. This section is related to an attempt to 
define a standard process for memory heap initialization and stack 
management (TWG_RFC00024-3). 


In the “SPU and Vector Multimedia Extension Intrinsics” chapter, clarified 
which intrinsic mappings are required according to this specification and 
which are not because a straightforward mapping does not exist. Provided 
additional explanations regarding the intrinsics that are difficult to map 
(TWG_RFC00034-1: CORRECTION NOTICE). 


Corrected the description of the si_stqx instruction (TWG_RFC00035-0: 
CORRECTION NOTICE). 


Corrected various documentation errors; for example, changed several 
descriptions in the “Alternate Vector Literal Format and Description” table. 


(TWG_RFC00036-0: CORRECTION NOTICE, TWG_RFC00041-0: 
CORRECTION NOTICE, TWG_RFC00045-0: CORRECTION NOTICE). 


Changed “Broadband Processor Architecture” to “Cell Broadband Engine” 
Architecture”, and changed “BPA” to “CBEA” (TWG_RFC00037-0: 
CORRECTION NOTICE). 


Deleted several references to BE revisions DD1.0 and DD2.0 
(TWG_RFC00040-0: CORRECTION NOTICE). 


Added a new chapter describing MFC I/O intrinsics; these intrinsics facilitate 
MFC programming by defining a common set of utility functions 
(TWG_RFC00043-2). 


Deleted several sections in the “About This Document” chapter. Changed two 
entries in the Write Word Channel table from si_wrch (channel, 
si_to_int(a)) tosi_wrch(channel, si from int (a)). Clarified 
that the syntax for vector type specifiers does not allow the use of a typedef 
name as a type specifier. (All changes per TWG_RFC00032-0: 
CORRECTION NOTICE.) 
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Version Number & Date Changes 








v. 1.9 
June 10, 2005 


v. 1.8 
May 12, 2005 


v. 1.7 
July 16, 2004 


v. 1.6 
March 12, 2004 


v. 1.5 
February 25, 2004 


v. 1.4 
January 20, 2004 


Added new chapter describing C and C++ Libraries (TWG_RFC00018-5). 


Added new chapter describing SPU floating-point arithmetic 
(TWG_RFC00027-1). 

Changed “Broadband Engine” or “BE” to “a processor compliant with the 
Broadband Processor Architecture” or “a processor compliant with BPA”; 
changed VMX to Vector Multimedia Extension; changed Synergistic 
Processing Element to Synergistic Processor Element; and changed 
Synergistic Processing Unit to Synergistic Processor Unit. Defined a PPU as 
a PowerPC Processor Unit on first major instance. Corrected several book 
references and changed copyright page so that trademark owners were 
specified. (All changes per TWG_RFC00031-0: CORRECTION NOTICE.) 
Made miscellaneous changes to the “About This Document” section. 


Added new channel number for multisource synchronization requests 
(TWG_RFC00023-1). 


Corrected example describing loading of misaligned vectors. 


Changed PU to PPU and SPC to SPE; changed “PU-to-SPU” (mailboxes) 
and “SPU-to-PU” to “inbound” and “outbound” respectively 
(TWG_RFC00028-1: CORRECTION NOTICE). 


Changed the name of spu_mulhh to spu mule (TWG_RFC00021-0). 


Updated channel names to coincide with BPA channel names 
(TWG_RFC00029-1). 


Clarified that channel intrinsics must not be reordered with respect to other 
channel commands or volatile local-storage memory accesses 
(TWG_RFCO00007-1). 


Warned that compliant compilers may ignore — align hint intrinsics 
(TWG_RFC00008-1). 


Added an additional SPU instruction, orx (TWG_RFC00010-0). 


Added mnemonics for channels that support reading the event mask and tag 
mask (TWG_RFC00011-0). 


Specified that spu_ienable and spu _idisable intrinsics do not have 
return values (TWG_RFC00013-0). 


Moved paragraph beginning “This intrinsic is considered volatile...” from 
spu_mfspr intrinsic to spu_mtfpscr (TWG_RFC00014-0). 

Changed the descriptions for si_1lqdand si_stqd intrinsics 
(TWG_RFCO00015-1). 


Provided new descriptions of various rotation-and-mask intrinsics, 
specifically: spu_rlmask, spu_rlmaska, spu_rlmaskqw, 
spu_rlmaskqwbyte, and spu_rlmaskqwbytebc. These descriptions 
include pseudo-code examples (TWG_RFC00016-1). 


Made miscellaneous editorial changes. 
Made miscellaneous editorial changes. 


Changed formatting of document so that it reflects the typographic 
conventions described on page xviii. Made miscellaneous editorial changes. 


Changed some of the parameter types for spu_mfcdma32 and 
spu_mfcdmaé4, as requested in TWG_RFCO00002. 


Inserted new specifications for the vector literal format, as requested in 
TWG_RFC00003. 


Changed document to new format, including front matter. Made 
miscellaneous editorial changes. 
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v. 1.3 
November 4, 2003 
v. 1.2 
September 2, 2003 


v. 1.1 
June 15, 2003 


v. 1.0 
April 28, 2003 


v. 0.9 
March 7, 2003 


v. 0.8 
January 23, 2003 


v. 0.7 
November 18, 2002 


v. 0.6 
September 24, 2002 


v. 0.5 
August 27, 2002 


Added enable/disable interrupt intrinsics. 


Changed parameter types of spu_se1 intrinsic to be compatible with Vector 
Multimedia Extension’s vec_sel. 


Added si_stopd specific intrinsic. 
Corrected tables for spu_genb and spu_genc generic intrinsics. 
Made changes to support RFC 24. Added isolation control channel 64. 


Made changes to support RFC 33. Removed spu_addc, spu_addsc, 
spu_subb, and spu_subsb. Added spu_addx, spu_subx, spu_genc, 
spu_gencx, spu_genb, and spu_genbx. 


Made minor corrections. 


Added new intrinsics to support new or modified instructions. These include: 
fscrrd, fscrwr, stop, dfma, mpyhhau, mpyhhu, rotqmbybi, iret, lqr, 
and stqr. Also added intrinsics to support new feature bits for iret, 
bisled, bihnz, and sync. 


Improved documentation of specific intrinsics. Completely defined parameter 
ordering and immediate sizes. 


Defined new global (spu_intrinsics.h) and compiler specific 
(spu_internals.h) header files. Specified that single token vector types 
and channel enumerants are declared in spu_intrinsics.h. 


Added specific pointer casting intrinsics. 
Added standardized — SPU__ conditional compilation control. 


Changed specific convert intrinsics to unbiased scale parameters, such as 
generic intrinsics. 


Specified that the bisled target function does not observe the standard calling 
convention with respect to volatile registers. 


Specified that gcc-style inline assembly is required. 
Specified that builtin expect is required. 
Added bisled specific and generic intrinsics. 

Added align hint intrinsic. 

Specified that the restrict type qualifier is required. 


Specified that out-of-range scale factors on generic conversion intrinsics 
return an error. 


Changed document title to include C++. 
Made miscellaneous clarifications and typing corrections. 
Changed spu_eqv to return the same vector type as its inputs. 


Changed spu_and, spu_or, and spu_xor to accept immediate values of 
the same type as the elements of parameter a. 


Added specific casting intrinsics. 


Changed default action on out-of-range immediate values for specific 
intrinsics to issuing an error. 


Added documentation of the = builtin expect builtin. 

Completed SPU-to-Vector Multimedia Extension intrinsic mapping section. 
Edited discussion of Vector Multimedia Extension-to-SPU intrinsic mapping. 
Removed appendices. 


Added support for 32-bit read and write channel intrinsics. Renamed 
quadword channel read and write to readchqw and writechqw. 
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July 16, 2002 
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July 9, 2002 
v. 0.1 


June 21, 2002 


Corrected the instruction mapping for spu_promote and spu_extract. 


Specified that instruction mapping for generic intrinsics spu_re and 
spu_rsqrte include the FI (floating-point interpolate) instruction. 


Renamed spu_splat to spu_splats (scalar splat) to avoid confusion with 
vec splat. 


Added documentation about the size of the immediate intrinsic forms. 
Changed all vector signed longtovector signed long long. 


Changed count to unsigned for spu_sl, spu_slqw, spu_slqwbyte, and 
spu_slqwbytebc. 


Changed count to signed for spu_rl, spu_rlmask and spu_rlmaska. 
Specified that the return value of spu_cnt1z is an unsigned value. 
Corrected description of spu_ gather intrinsic. 


Edited mapping documentation of scalars for spu_and, spu_or, and 
spu_xor. 


Removed vector input forms of spu_hcmpeq and spu_hempgt. 
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portability. 
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Changed vector long types to vector long long. 
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Document Structure 
This document contains the following major sections: 


. Data Types and Programming Directives 

. SPU Low-Level Specific and Generic Intrinsics 

. Composite Intrinsics 

. Programming Support for MFC Input and Output 

. SPU and Vector Multimedia Extension Intrinsics 

. PPU VMX Intrinsics 

. PPU Intrinsics 

. SPU C and C++ Standard Libraries and Language Support 
. Floating-Point Arithmetic on the SPU 


OA N Oat ON =| 


Bit Notation 


Standard bit notation is used throughout this document. Bits and bytes are numbered in ascending order from left to 
right. Thus, for a 4-byte word, bit 0 is the most significant bit and bit 31 is the least significant bit, as shown in the 
following figure: 


~<— MSB 


~<=- LSB 





Le hlelslals[e{r |e] o}ro]x|i2}s3 14] 15] 16] 17] 18] 19] 20) 21 |22 | 23] 24] 25 |26] 27] 28] 29] 30 | 31 | 


MSB = Most significant bit 


LSB = Least significant bit 


Notation for bit encoding is as follows: 


e Hexadecimal values are preceded by 0x. For example: 0x0A00. 
e Binary values in sentences appear in single quotation marks. For example: ‘1010’. 


Byte Ordering and Element Numbering 


As shown in Figure 1-1, byte ordering and element numbering is always displayed in big endian order. 


Figure 1-1: Big-Endian Byte/Element Ordering for Vector Types 


Byte 4 | Byte 5 | Byte 6 


doubleword 0 










Byte 0 
(MSB) 


Byte 1 | Byte 2 |Byte 3 Byte 7 | Byte 8 | Byte 9 |Byte 10 [Byte 11|Byte 12 |Byte 13 |Byte 14 |Byte 15 
































doubleword 1 








word 0 


halfword 1 halfword 2 halfword 3 halfword 4 halfword 5 halfword 6 halfword 7 


char 8 | char9 | char 10 |char 11 | char 12 | char 13| char 14 | char 15 






halfword 0 





char0O | char1 | char2 | char3| char4 | char 5 | char6 | char7 
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Typographic Conventions 


About This Document 


In addition to bit notation, the following typographic conventions are used throughout this document: 





Convention 


Meaning 








courier 


courier + 
italics 


italics (without 
courier) 


blue 


Indicates programming code, processing instructions, register names, 
data types, events, file names, and other literals. Also indicates function 
and macro names. This convention is only used where it facilitates 
comprehension, especially in narrative descriptions. 


Indicates arguments, parameters and variables, including variables of 
type const. This convention is only used where it facilitates 
comprehension, especially in narrative descriptions. 

Indicates emphasis. Except when hyperlinked, book references are in 
italics. When a term is first defined, it is often in italics. 


Indicates a hyperlink (color printers or online only). 
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1. Data Types and Programming Directives 


This chapter specifies PPU Vector Multimedia eXtension™ (VMX) and SPU vector data types, operations on these 
data types, programming directives, and predefined macro target definitions. 


Any confict between the requirements described here for PPU Vector Multimedia eXtension (VMX) data types and 
the Altivec Technology Programming Interface Manual is unintentional. 


1.1. Data Types 


A set of fundamental vector data types are introduced to the C language. These data types are shown in Table 1-1 
along with whether the type is supported on the PPU, SPU, or both. All of these data types are 128-bits long and 
contain from 2 to 16 elements, depending on the corresponding element data type. 


Table 1-1: Vector Data Types 











Vector Data Type Content SPU/PPU 
vector unsigned char 16 8-bit unsigned chars Both 
vector signed char 16 8-bit signed chars Both 
vector unsigned short 8 16-bit unsigned halfwords Both 
vector signed short 8 16-bit signed halfwords Both 
vector unsigned int 4 32-bit unsigned words Both 
vector signed int 4 32-bit signed words Both 
vector unsigned long long 2 64-bit unsigned doublewords SPU 
vector signed long long 2 64-bit signed doublewords SPU 
vector float 4 32-bit single-precision floats Both 
vector double 2 64-bit double-precision floats SPU 
qword quadword (16-byte), used exclusively as an input/output to a SPU 
specific intrinsic function. See section “2.1. Specific Intrinsics” 
vector bool char 16 8-bit bools — 0 (false) 255 (true) PPU 
vector bool short 8 16-bit bools —0 (false) 65535 (true) PPU 
vector bool int 4 32-bit bools — 0 (false) 2°* — 1 (true) PPU 
vector pixel 8 16-bit unsigned halfword, 1/5/5/5 pixel PPU 





The syntax for vector type specifiers does not allow the use of a typedef name as a type specifier. For example, the 
following declaration is not allowed: 


typedef signed short int16; 
vector int16 data; 


1.1.1. Mapping of PPU Data Types To SPU Data Types 


Not all PPU vector data types are supported on the SPU. The PPU vector data types that do not map identically to 
SPU data types are shown in Table 1-2. 


Table 1-2: Non-identical Mapping of VMX Data Types To SPU Data Types 











VMX Data Type Maps to SPU Data Type 
vector bool char vector unsigned char 
vector bool short vector unsigned short 
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VMX Data Type 


vector bool int 


Maps to SPU Data Type 


vector unsigned int 








vector pixel vector unsigned short" 





1 Because vector pixel and vector bool short are mapped to the same base vector type (vector 
unsigned short), the overloaded functions for vec_unpackh and vec_unpack1 cannot be uniquely resolved. 


1.1.2. Mapping of SPU Data Types To PPU Data Types 
Not all SPU data types are supported by the PPU VMX. The SPU data types that do not map identically to PPU 
vector data types are shown in Table 1-3. 
Table 1-3: Non-identical Mapping of SPU Data Types To VMX Data Types 
SPU Data Type Maps to VMX Data Type 


vector unsigned long long vector bool char 











vector signed long long vector bool short 


vector double vector bool int 





1.2. Header Files 


There are separate system header files for the SPU and PPU that include typedefs and other information required 
by the language extension features defined in this specification. 


The SPU system header file, spu_intrinsics.h, defines common enumerations and typedefs. These include the 
single token vector types and MFC channel mnemonic enumerations (see Table 1-4 on page 2 and Table 2-89 on 
page 48, respectively). In addition, sou_intrinsics.h will include a compiler specific header file, 
spu_internals.h, that contains any implementation specific definitions. 


The PPU system header file, altivec.h, defines typedefs and keywords and also includes any implementation 
specific definitions. The PPU system header file, vec_types.h, defines typedefs required by the language 
extension features defined in this specification. 


1.2.1. Single Token Typedefs 


To improve code portability, single token typedefs are provided for the vector keyword data types. These typedefs, 
which are shown in Table 1-4 are defined in spu_intrinsics.hon the SPU and in vec_types.hon the PPU. 
Besides simplifying type declarations, the single token types serve as class names for extending generic intrinsics or 
for mapping between PPU VMX intrinsics and/or SPU intrinsics. 


Table 1-4: Single Token Vector Data Types 











Vector Keyword Data Type Single Token Typedef SPU/PPU 
vector unsigned char vec_uchar16 Both 
vector signed char vec_char16 Both 
vector unsigned short vec_ushort8 Both 
vector signed short vec_short8 Both 
vector unsigned int vec_uint4 Both 
vector signed int vec_int4 Both 
vector unsigned long long vec_ullong2 SPU 
vector signed long long vec_llong2 SPU 
vector float vec_float4 Both 
vector double vec_double2 SPU 
vector bool char vec_bchar16 PPU 
vector bool short vec_bshort8 PPU 
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Vector Keyword Data Type Single Token Typedef SPU/PPU 
vector bool int vec_bint4 PPU 
vector pixel vec_pixel8 PPU 





1.3. Alignment 


Table 1-5 shows the size and default alignment of the various data types. 


Table 1-5: Default Data Type Alignments 











Data Type Size Alignment 

char 1 byte 

short 2 halfword 

int 4 word 

long 4 word/doubleword 
long long 8 doubleword 

float 4 word 

double 8 doubleword 
pointer 4 word 

vector 16 quadword 





Additional alignment controls can be achieved on a variable or on a structure/union member using the GCC aligned 
attribute. For example, in the following declaration statement, the floating-point scalar factor can be aligned ona 
quadword boundary: 


float factor attribute ((aligned (16))); 


1.3.1. _ align_hint (SPU only) 
The align _hint intrinsic is provided to: 
e Improve data access through pointers 
e Provide compilers the additional information that is needed to support auto-vectorization 
This instrinsic is available only for the SPU. Although it is also useful for the PPU, supporting it is not required. 


Although align hint is defined as an intrinsic, it behaves like a directive, because no code is ever specifically 
generated. For example: 


__align_hint(ptr, base, offset) 


The align hint intrinsic informs the compiler that the pointer ptr points to data with a base alignment of base 
and with an offset from base of offset. The base alignment has to be a power of 2. A base address of zero 
implies that the pointer has no known alignment. The alignment offset has to be less than base or zero. 


The align hint intrinsic is not intended to specify pointers that are not naturally aligned. Specifying pointers 
that are not naturally aligned results in data objects straddling quadword boundaries. If a programmer specifies 
alignment incorrectly, incorrect programs might result. 


Programming Note: Although compliant compiler implementations have to provide the — align hint intrinsic, 
compilers may ignore these hints. 


1.4. Operating on Vector Types 


Most of the C/C++ operators and basic operations have not been extended to operate on vector data types; 
however, a few have been extended. The operators and operations that have been extended are: the sizeof () 
operator, the assignment operator (=), the address operator (&), pointer operations, and type casting operations. 
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1.4.1. sizeof() Operator 


The operation sizeof () on a vector type always returns 16. 


1.4.2. Assignment Operator 


If either the left or right side of an expression has a vector type, both sides of the expression has to be of the same 
vector type. Thus, the expression a = bis valid and represents assignment if a and b are of the same type or if 
neither variable is a vector type. Otherwise, the expression is invalid, and the compiler reports the inconsistency as 
an error. 


1.4.3. Address Operator 


The operation &a is valid when a is a vector type. The result of the operation is a pointer to vector a. 


1.4.4. Pointer Arithmetic and Pointer Dereferencing 


The usual pointer arithmetic involving a pointer to a vector type can be performed. For example, assuming pis a 
pointer to a vector type, p+ is the pointer to the next vector following p. 


Dereferencing the vector pointer p implies a 128-bit vector load from or store to the address obtained by masking 
the 4 least significant bits of p. When a vector is misaligned, the 4 least significant bits of its address are nonzero. 
Although vectors are 16-byte aligned (see section “1.3. Alignment”), it nevertheless might be desirable to load or 
store a vector that is misaligned. A misaligned vector can be loaded in several ways using generic intrinsics (see 
section “2.2. Generic Intrinsics and Built-ins”). 


The following code shows one example of how to load a misaligned floating-point vector on the SPU: 


vector float load misaligned vector float (vector float *ptr) 
{ 

vector float qw0, qwl; 

int shift; 


qw0 = *ptr; 
qw1 *(ptr+1); 
shift = (unsigned) ptr & 15; 


return spu_or ( 
spu_slqwbyte(qw0, shift), 
spu_rimaskqwbyte(qwl, shift-16)); 


Similarly, this next example shows how to store to a misaligned floating-point vector on the SPU. 


void store misaligned vector float (vector float flt, vector float *ptr) 
{ 

vector float qw0, qwl; 

vector unsigned int mask; 

int shift; 


qw0 = *ptr; 


qwl = *(ptr+1); 
shift = (unsigned) (ptr) & 15; 
mask = (vector unsigned int) 


spu _ rlmaskqwbyte ( (vector unsigned char) (OxFF), -shift); 
flt = spu_rlqwbyte(flt, -shift); 


*ptr = spu_sel(qw0, flt, mask); 
*(ptr+1) = spu_sel(flt, qwl, mask); 
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1.4.5. Type Casting 


Pointers to vector types and non-vector types may be cast back and forth to each other. For the purpose of aliasing, 
a vector type is treated as an array of its corresponding element type, as shown in Table 1-6. If a pointer is cast to 
the address of a vector type, it is the programmer's responsibility to ensure that the address is 16-byte aligned. 
Vector types that are applicable only on the PPU do not have an underlying scalar type. 


Table 1-6: Vector Pointer Types and Matching Base Element Pointer Types 











Vector Pointer Type (vector T*) Base Element Pointer Type (T*) SPU/PPU 
vector unsigned char* unsigned char* Both 
vector signed char* signed char* Both 
vector unsigned short* unsigned short* Both 
vector signed short* signed short* Both 
vector unsigned int* unsigned int* Both 
vector signed int* signed int* Both 
vector unsigned long long* unsigned long long* SPU 
vector signed long long* signed long long* SPU 
vector float* float* Both 
vector double* double* SPU 





Casts from one vector type to another vector type has to be explicit and are done using normal C-language casts. 
None of these casts performs any data conversion. Thus, the bit pattern of the result is the same as the bit pattern 
of the argument that is cast. 


Casts between vector types and scalar types are illegal. On the SPU, the spu_extract, spu_insert, and 

spu_ promote generic intrinsics or the specific casting intrinsics may be used to efficiently achieve the same results 
(see section “2.1.1. Specific Casting Intrinsics”). On the PPU, the vec_lde and vec_ste intrinsics may be used 
to copy between scalar and vector types. 





1.4.6. Vector Literals 


As shown in Table 1-7, a vector literal is written as a parenthesized vector type followed by a curly braced set of 
constant expressions. If a vector literal is used as an argument to a macro, the literal has to be enclosed in 
parentheses. In all other cases, the literal can be used without enclosing parentheses. The elements of the vector 
are initialized to the corresponding expression. Elements for which no expressions are specified default to 0. Vector 
literals may be used either in initialization statements or as constants in executable statements. The syntax for 
vector initialization and for vector compound literals is the same as the corresponding array syntax except 
designators which do not exist for vector elements. The initializer should act as an array of either 2, 4, 8, or 16 
elements depending on the size of the underlying type. For example the following two initializations are valid and 
equivalent: 


vector signed int v1[] {{0, 1, 2, 33,44, 5; 67 T}h 
vector signed int v2[] (0z Ly 2y Sy Ay Dy Gy Te 
The following two struct initializers are also valid and equivalent: 


struct stypy { 
int i; 
vector signed int t; 
} v3 = {1, {0, 1, 2, 3}}, v4 = {1, 0, 1, 2, 3}; 


The following types on both the SPU and PPU cannot be initialized using a vector literal: qword, vector bool 
char, vector bool short, vector bool int, and vector pixel. They can be created by using the intrinsics 
or by casting to these vector types. 


Table 1-7: Vector Literal Format and Description 











Notation Represents SPU/PPU 
(vector unsigned char) {unsigned char, ...} A set of 16 unsigned 8-bit quantities. Both 
(vector signed char) {signed char, ...} A set of 16 signed 8-bit quantities. Both 
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Notation Represents SPU/PPU 
(vector unsigned short) {unsigned short, ...} A set of 8 unsigned 16-bit quantities. Both 
(vector signed short) {signed short, ...} A set of 8 signed 16-bit quantities. Both 
(vector unsigned int) {unsigned int, ...} A set of 4 unsigned 32-bit quantities. Both 
(vector signed int) {signed int, ...} A set of 4 signed 32-bit quantities. Both 
(vector unsigned long long) {unsigned long long, ...} A set of 2 unsigned 64-bit quantities. SPU 
(vector signed long long) {signed long long, ...} A set of 2 signed 64-bit quantities. SPU 
(vector float) {float, ...} A set of 4 32-bit floating-point quantities. Both 
(vector double) {double, ...} A set of 2 64-bit floating-point quantities. SPU 





An alternate format may also be supported which corresponds to the syntax specified in the Altivec Technology 
Programming Interface Manual. This format consists of a parenthesized vector type followed by a parenthesized set 


of constant expressions. See Table 1-8. 


Table 1-8: Alternate Vector Literal Format and Description 











Notation Represents SPU/PPU 
A , : A set of 16 unsigned 8-bit quantities that all 
(vector unsigned char)(unsigned int) have the value specified by the integer. Both 
: , ; ‘ P A set of 16 unsigned 8-bit quantities specified 
(vector unsigned char)(unsigned int, ..., unsigned int) by the 16 integers. Both 
. A : A set of 16 signed 8-bit quantities that all have 
(vector signed char)(signed int) the value specified by the integer. Both 
: F F : : A set of 16 signed 8-bit quantities specified by 
(vector signed char)(signed int, ..., signed int) the 16 integers. Both 
: : 3 A set of 8 unsigned 16-bit quantities that all 
(vector unsigned short)(unsigned int) have the value specified by the integer. Both 
; ` F : ; A set of 8 unsigned 16-bit quantities specified 
(vector unsigned short)(unsigned int, ..., unsigned int) by the 8 integers. Both 
: : F A set of 8 signed 16-bit quantities that all have 
(vector signed short)(signed int) the value specified by the integer. Both 
(vector signed short)(signed int, ..., signed int) feet of 8 signed 16-bit quantiles specified by Both 
the 8 integers. 
F : . ; A set of 4 unsigned 32-bit quantities that all 
(vector unsigned int)(unsigned int) have the value specified by the integer. Both 
; ; ; : F , A set of 4 unsigned 32-bit quantities specified 
(vector unsigned int)(unsigned int, ..., unsigned int) by the 4 integers. Both 
; : : F A set of 4 signed 32-bit quantities that all have 
(vector signed int)(signed int) the value specified by the integer. Both 
; z , ; F : A set of 4 signed 32-bit quantities specified by 
(vector signed int)(signed int, ..., signed int) the 4 integers. Both 
: f A set of 2 unsigned 64-bit quantities that all 
(vector unsigned long long)(unsigned long long) have the value specified by the long integer. SPU 
(vector unsigned long long)(unsigned long long, A set of 2 unsigned 64-bit quantities specified SPU 
unsigned long long) by the 2 long integers. 
, : A set of 2 signed 64-bit quantities that all have 
(vector signed long long)(signed long long) the value specified by the long integer. SPU 
(vector signed long long)(signed long long, A set of 2 signed 64-bit quantities specified by SPU 
signed long long) the 2 long integers. 
(vector float)(float) A set of 4 32-bit floating-point quantities that Both 


all have the value specified by the float. 
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Notation Represents SPU/PPU 








A set of 4 32-bit floating-point quantities 


(vector float)(float, float, float, float) specified by the 4 floats. Both 
A set of 2 64-bit double-precision quantities 

(vector double ene) that all have the value specified by the double. SPY 

(vector double)(double, double) A set of 2 64-bit quantities specified by the 2 SPU 


doubles. 





Restrict Type Qualifier 


The restrict type qualifier, which is specified in the C99 language specification, is intended to help the compiler 
generate better code by ensuring that all access to a given object is obtained through a particular pointer. When a 
pointer uses the restrict type qualifier, the pointer is rest rict-qualified. For example: 


void *memcpy(void * restrict sl, const void * restrict s2, size t n); 


In the above prototype, both pointers, s1 and s2, are restrict-qualified. Therefore, the compiler can safely 
assume that the source and destination objects will not overlap, allowing for a more efficient implementation. 


SPU Programmer Directed Branch Prediction 


Branch prediction can be significantly improved by using feedback-directed optimization. However, feedback- 
directed optimization is not always practical in situations where typical data sets do not exist. Instead, on the SPU, 
programmer-directed branch prediction is provided using an enhanced version of GCC’s_ builtin expect 
function. 


int builtin _expect(int exp, int value) 


Programmers canuse builtin expect to provide the compiler with branch prediction information. The return 
value of builtin expect is the value of the exp argument, which has to be an integral expression. For 
dynamic prediction, the value argument can be either a compile-time constant or a variable. The 

__ builtin expect function assumes that exp equals value. 





Static Prediction Example 
if (_ builtin expect(x, 0)) { 
foo (); /* programmer doesn’t expect foo to be called */ 





} 





Dynamic Prediction Example 


cond2 = ... /* predict a value for condl */ 
condl =... 
if (_ builtin expect (cond1, cond2)) { 
foo(); 
} 
cond2 = condl; /* predict that next branch is the same as the 


previous */ 


Compilers may require limiting the complexity of the expression argument because multiple branches could be 
generated. When this situation occurs, the compiler has to issue a warning if the program’s branch expectations are 
ignored. 


Programming Note: Implementation of this extension is not required for the PPU because the PPU only supports 
static prediction for branches 
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1.8. 


Inline Assembly 


Occasionally, a programmer might not be able to achieve the desired low-level programming result by using only 
C/C++ language constructs and intrinsic functions. To handle these situations, the use of inline assembly might be 
necessary, and therefore, it has to be provided. The inline assembly syntax have to match the AT&T assembly 
syntax implemented by GCC. 


The .balign1 directive may be used within the inline assembly to ensure the known alignment that is needed to 
achieve effective dual-issue by the hardware. 


Target Definitions 


To support the development of code that can be conditionally compiled for multiple targets, compilers has to define 
___SPU__, when code is being compiled for the SPU, and = PPU___, when code is being compiled for the PPU. As 
an example, the following code supports misaligned quadword loads. The SPU and PPU___ defines are used 
to conditionally select which code to use. The code that is selected will be different depending on the processor 
target. 





vector unsigned char load_qword_unaligned(vector unsigned char *ptr) 


vector unsigned char qw0, qwl, qw; 
ifdef SPU __ 
unsigned int shift; 
endif 
qw0 = *ptr; 
qwl = *(ptrt+1); 
ifdef SPU __ 
shift = (unsigned int) (ptr) & 15; 
qw = spu_or(spu_slqwbyte(qw0, shift), 
spu_rlmaskqwbyte(qwl, (signed) (shift - 16))); 








#elif defined(  PPU_) /* PPU */ 
qw = vec_perm(qw0, qwl, vec_lvsl(0, ptr)); 
#else 
# error “This code can only be compiled for PPU or the SPU” 
#fendif 


return (qw); 
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2. SPU Low-Level Specific and Generic Intrinsics 


This chapter describes the minimal set of basic intrinsics and built-ins that make the underlying Instruction Set 
Architecture (ISA) and Synergistic Processor Element (SPE) hardware accessible from the C programming 
language. There are three types of intrinsics: 


e Specific 
e Generic 
e Built-ins 


Intrinsics may be implemented either internally within the compiler or as macros. However, if an intrinsic is 
implemented as a macro, restrictions apply with respect to vector literals being passed as arguments. For more 
details, see section “1.4.6. Vector Literals”. 


2.1. Specific Intrinsics 


Specific intrinsics are specific in the sense that they have a one-to-one mapping with a single SPU assembly 
instruction. All specific intrinsics are named using the SPU assembly instruction prefixed by the string si_. For 
example, the specific intrinsic that implements the stop assembly instruction is named si_ stop. 


A specific intrinsic exists for nearly every assembly instruction. However, the functionality provided by several of the 
assembly instructions is better provided by the C/C++ language; therefore, for these instructions no specific intrinsic 
has been provided. Table 2-9 describes the assembly instructions that have no corresponding specific intrinsic. 


Table 2-9: Assembly Instructions for Which No Specific Intrinsic Exists 





Instruction Type SPU Instructions 








Branch instructions br, bra, brsl, brasl, bi, bid, bie, bisl, bisld, bisle, brnz, brz, brhnz, brhz, biz, bizd, 
bize, binz, binzd, binze, bihz, bihzd, bihze, bihnz, bihnzd, and bihnze (excluding 
bisled, bisledd, bislede) 


Branch Hint instructions hbr, hbrp, hbra, and hbrr 
Interrupt Return Instructions iret, iretd, irete 





All specific intrinsics are accessible through generic intrinsics, except for the specific intrinsics shown in Table 2-10. 
The intrinsics that are not accessible fall into three categories: 


e Instructions that are generated using basic variable referencing (that is, using vector and scalar loads and 
stores) 

e Instructions that are used for immediate vector construction 

e Instructions that have limited usefulness and are not expected to be used except in rare conditions 


Table 2-10: Specific Intrinsics Not Accessible through Generic Intrinsics 





Instruction/Description Usage Assembly Mapping 








Generate Controls for Sub-Quadword Insertion 

si_cbd: Generate Controls for Byte Insertion (d-form) 

An effective address is computed by adding the value in the signed 7-bit 

immediate imm to word element 0 of a. The rightmost 4 bits of the effective 

address are used to determine the position of the addressed byte withina @= Si_cbd(a, imm) CBD d, imm(a) 
quadword. Based on the position, a pattern is generated that can be used 

with the si_ shufb intrinsic to insert a byte (byte element 3) at the indicated 

position within a quadword. The pattern is returned in quadword d. 
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Instruction/Description Usage Assembly Mapping 








si_cbx: Generate Controls for Byte Insertion (x-form) 


An effective address is computed by adding the value of word element 0 of 

a to word element 0 of b. The rightmost 4 bits of the effective address are 

used to determine the position of the addressed byte within a quadword. d= si_cbx(a, b) CBX d, a, b 
Based on the position, a pattern is generated that can be used with the 
si_shufb intrinsic to insert a byte (byte element 3) at the indicated position 
within a quadword. The pattern is returned in quadword d. 

si_cdd: Generate Controls for Doubleword Insertion (d-form) 

An effective address is computed by adding the value in the signed 7-bit 
immediate imm to word element 0 of a. The rightmost 4 bits of the effective 
address are used to determine the position of the addressed doubleword 
within a quadword. Based on the position, a pattern is generated that can be 
used with the si_ shufb intrinsic to insert a doubleword (doubleword 
element 0) at the indicated position within a quadword. The pattern is 
returned in quadword a. 


d=si_cdd(a, imm) CDD d, imm(a) 


si_cdx: Generate Controls for Doubleword Insertion (x-form) 


An effective address is computed by adding the value of word element 0 of 

a to word element 0 of b. The rightmost 4 bits of the effective address are 

used to determine the position of the addressed doubleword within a d=si_cdx(a, b) CDX d, a, b 
quadword. Based on the position, a pattern is generated that can be used ~ 

with the si_ shufb intrinsic to insert a doubleword (doubleword element 3) 

at the indicated position within a quadword. The pattern is returned in 

quadword d. 


si_chd: Generate Controls for Halfword Insertion (d-form) 


An effective address is computed by adding the value in the signed 7-bit 

immediate imm to word element 0 of a. The rightmost 4 bits of the effective 

address are used to determine the position of the addressed halfword within a= Si_chd(a, imm) CHD d, imm(a) 
a quadword. Based on the position, a pattern is generated that can be used 

with the si_ shufb intrinsic to insert a halfword (halfword element 1) at the 

indicated position within a quadword. The pattern is returned in quadword d. 


si_chx: Generate Controls for Halfword Insertion (x-form) 


An effective address is computed by adding the value of word element 0 of 

a to word element 0 of b. The rightmost 4 bits of the effective address are 

used to determine the position of the addressed halfword within a quadword. 4= gj chx(a, b) CHX d, a, b 
Based on the position, a pattern is generated that can be used with the 7 

si_shufb intrinsic to insert a halfword (halfword element 1) at the 

indicated position within a quadword. The pattern is returned 

in quadword d. 


si_cwd: Generate Controls for Word Insertion (d-form) 


An effective address is computed by adding the value in the signed 7-bit 

immediate imm to word element 0 of a. The rightmost 4 bits of the effective 

address are used to determine the position of the addressed word withina a= Si_cwd(a, imm) CWD d, imm(a) 
quadword. Based on the position, a pattern is generated that can be used 

with the si_shufb intrinsic to insert a word (word element 0) at the 

indicated position within a quadword. The pattern is returned in quadword d. 

si_cwx: Generate Controls for Word Insertion (x-form) 

An effective address is computed by adding the value of word element 0 of 

a to word element 0 of b. The rightmost 4 bits of the effective address are 

used to determine the position of the addressed word within a quadword. d=si_cwx(a,b) CWXd,a,b 
Based on the position, a pattern is generated that can be used with the 

si_shufb intrinsic to insert a word (element 0) at the indicated position 

within a quadword. The pattern is returned in quadword d. 
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Instruction/Description 


Usage 


Assembly Mapping 








Constant Formation Intrinsics 

si_il: Immediate Load Word 

The 16-bit signed immediate value imm is sign extended to 32-bits and 
placed into each of the 4 word elements of quadword d. 

si_ila: Immediate Load Address 

The 18-bit immediate value imm is placed in the rightmost bits of each of the 
4 word elements of quadword d. The upper 14 bits of each word is set to 0. 
si_ilh: Immediate Load Halfword 

The 16-bit signed immediate value imm is placed in each of the 8 halfword 
elements of quadword ad. 

si_ilhu: Immediate Load Halfword Upper 

The 16-bit signed immediate value imm is placed into the 

left-most 16 bits each of the 4 word elements of quadword d. The rightmost 
16 bits are set to 0. 

si_iohl: Immediate Or Halfword Lower 

The 16-bit immediate value immis prepended with zeros and ORed with 
each of the 4 word elements of quadword a. The result is returned in 
quadword d. 

No Operation Intrinsics 

si_Inop: No Operation (load) 

A no-operation is performed on the load pipeline. 

si_nop: No Operation (execute) 

A no-operation is performed on the execute pipeline. 

Memory Load and Store Intrinsics 

si_lqa: Load Quadword (a-form) 


An effective address is determined by the sign-extended 18-bit value imm, 
with the 4 least significant bits forced to zero. The quadword at this effective 
address is returned in quadword d. 


si_lqd: Load Quadword (d-form) 

An effective address is computed by zeroing the 4 least significant bits of 
the sign-extended 14-bit immediate value imm, adding imm to word element 
0 of quadword a, and forcing the 4 least significant bits of the result to zero. 
The quadword at this effective address is then returned in quadword a. 
si_Iqr: Load Quadword Instruction Relative (a-form) 

An effective address is computed by forcing the 2 least significant bits of the 
signed 18-bit immediate value imm to zero, adding this value to the address 
of the instruction, and forcing the 4 least significant bits of the result to zero. 
The quadword at this effective address is then returned in quadword a. 
si_lqx: Load Quadword (x-form) 

An effective address is computed by adding word element 0 of quadword a 
to word element 0 of quadword b and forcing the 4 least significant bits to 
zero. The quadword at this effective address is then returned in quadword ad. 
si_stqa: Store Quadword (a-form) 

An effective address is determined by the sign-extended 18-bit value imm, 
with the 4 least significant bits forced to zero. The quadword a is stored at 
this effective address. 


d= si_il(imm) 


d= si_ila(imm) 


d= si_ilh(imm) 


d= si_ilhu(imm) 


d= si_iohl(a, imm) 


si_Inop() 


si_nop() 


d= si_lqa(imm) 


d= si_lqd(a, imm) 


d= si_lqr(imm) 


d= si_lqx(a, b) 


si_stqa(a, imm) 


IL d, imm 


ILA d, imm 


ILH d, imm 


ILHU d, imm 


rt <--- a 
IOHL rt, imm 
d <--- rt 


LNOP 


NOP rt' 


LQA d, imm 


LQD d, imm(a) 


LQR, d, imm 


LQX d, a, b 


STQA a, imm 
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Instruction/Description 


Usage 


Assembly Mapping 








si_stqd: Store Quadword (d-form) 

An effective address is computed by zeroing the 4 least significant bits of 
the sign-extended 14-bit immediate value imm, adding imm to word element 
0 of quadword b, and forcing the 4 least significant bits to zero. The 
quadword a is then stored at this effective address. 

si_stqr: Store Quadword Instruction Relative (a-form) 

An effective address is computed by forcing the 2 least significant bits of the 
signed 18-bit immediate value imm to zero, adding this value to the address 
of the instruction, and forcing the 4 least significant bits of the result to zero. 
The quadword a is then stored at this effective address. 

si_stqx: Store Quadword (x-form) 

An effective address is computed by adding word element 0 of quadword b 
to word element 0 of quadword c and forcing the 4 least significant bits to 
zero. The quadword a is then stored at this effective address. 

Control Intrinsics 

si_stopd: Stop and Signal with Dependencies 

Execution of the SPU is stopped and a signal type of 0x3FFF is delivered 
after all register dependencies are met. This intrinsic is considered volatile 
with respect to all instructions and will not be reordered with any other 
instructions. 


si_stqd(a, b, imm) 


si_stqr(a, imm) 


si_stqx(a, b, c) 


si_stopd(a, b, c) 


STQD a, imm(b) 


STQR, a, imm 


STQX a,b,c 


STOPD a, b, c 





‘The false target parameter rt is optimally chosen depending on the register usage of neighboring instructions. 


Specific intrinsics accept only the following types of arguments: 


e Immediate literals, as an explicit constant expression or as a symbolic address 


e Enumerations 
e qword arguments 


Arguments of other types must be cast to qword. 


For complete details on the specific instructions, see the Synergistic Processor Unit Instruction Set Architecture. 


2.1.1. Specific Casting Intrinsics 


When using specific intrinsics, it might be necessary to cast from scalar types to the qword data type, or from the 
qword data type to scalar types. Similar to casting between vector data types, specific cast intrinsics have no effect 
on an argument that is stored in a register. All specific casting intrinsics are of the following form: 


d=casting intrinsic (a) 


See Table 2-11 for additional details about the specific casting intrinsics. 
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Table 2-11: Specific Casting Intrinsics 





Return/Argument Types 








Casting Intrinsic Description 

d a 
si_to_char signed char Cast byte element 3 of qword a to signed char a. 
si_to_uchar unsigned char Cast byte element 3 of qword a to unsigned char a. 
si_to_short short Cast halfword element 1 of qword a to short a. 
si_to_ushort unsigned short Cast halfword element 1 of qword a to unsigned short d. 
si_to_int int Cast word element 0 of qword a to int a. 
si_to_uint unsigned int qword Cast word element 0 of qword a to unsigned int a. 
si_to_ptr void * Cast word element 0 of qword a to a void pointer d. 
si_to_llong long long Cast doubleword element 0 of qword a to long long a. 
si_to_ullong unsigned long long casi doubleword element 0 of qword a to unsigned 

ong long a. 

si_to_float float Cast word element 0 of qword a to float d. 
si_to_double double Cast doubleword element 0 of qword a to double a. 
si_from_char signed char Cast signed char a to byte element 3 of qword d. 
si_from_uchar unsigned char Cast unsigned char a to byte element 3 of qword d. 
si_from_short short Cast short a to halfword element 1 of qword a. 
si_from_ushort unsigned short Cast unsigned short a to halfword element 1 of qword d. 
si_from_int int Cast int a to word element 0 of qword a. 
si_from_uint qword unsigned int Cast unsigned int a to word element 0 of qword a. 
si_from_ptr void * Cast void pointer a to word element 0 of qword a. 
si_from_llong long long Cast long long a to doubleword element 0 of qword a. 
si_from_ullong unsigned long long Cast unsigned long long a to doubleword element 0 of 
si_from_float float Cast float a to word element 0 of qword d. 
si_from_double double Cast double a to doubleword element 0 of qword a. 





Because the casting intrinsics do not perform data conversion, casting from a scalar type to a qword type results in 
portions of the quadword being undefined. 


2.2. Generic Intrinsics and Built-ins 


Generic intrinsics are operations that map to one or more specific intrinsics. The mapping of a generic intrinsic to a 
specific intrinsic depends on the input arguments to the intrinsic. Built-ins are similar to generic intrinsics; however, 
unlike generic intrinsics, built-ins map to more than one SPU instruction. All generic intrinsics and built-ins are 
prefixed by the string spu_. For example, the generic intrinsic that implements the st op assembly instruction is 
named spu_ stop. 


2.2.1. Mapping Intrinsics with Scalar Operands 


Intrinsics with scalar arguments are introduced for SPU instructions with immediate fields. For example, the intrinsic 
function vector signed int spu_add(vector signed int, int) will translate to an AI assembly instruction. 


Depending on the assembly instruction, immediate values are either 7, 10, 16, or 18 bits in length. The action 
performed for out-of-range immediate values depends on the type of intrinsic. By default, immediate-form specific 
intrinsics with an out-of-range immediate value are flagged as an error. Compilers may provide an option to issue a 
warning for out-of-range immediate values and use only the specified number of least significant bits for the 
out-of-range argument. 


Generic intrinsics support a full range of scalar operands. This support is not dependent on whether the scalar 
operand can be represented within the instruction’s immediate field. Consider the following example: 


C/C++ Language Extensions for Cell Broadband Engine™ Architecture, Version 2.3 


13 


SONY 


SONY 


14 SPU Low-Level Specific and Generic Intrinsics 


EOMERI © 


d = spu and (vector unsigned int a, int b); 
Depending on argument b, different instructions are generated: 


e If bis a literal constant within the range supported by one of the immediate forms, the immediate instruction 
form is generated. For example, if b equals 1, then ANDI d, a, 1 is generated. 


e If bis a literal constant and is out-of-range but can be folded and implemented using an alternate immediate 
instruction form, the alternate immediate instruction is generated. For example, if b equals 0x30003, then 
ANDHI d, a, 3 is generated. In this context, “alternate immediate instruction form” means an immediate 
instruction form having a smaller data element size. 


e If bis a literal constant that can be constructed using one or two immediate load instructions followed by the 
non-immediate form of the instruction, the appropriate instructions will be used. Immediate load instructions 
include IL, ILH, ILHU, ILA, IOHL, and FSMBI. Table 2-12 shows possible uses of the immediate load 
instructions for various constants b. 





Table 2-12: Possible Uses of Immediate Load Instructions for Various Values of Constant b 











Constant b Generates Instructions 
IL b, -6000 
-6000 AND d,a,b 
ILH b,2 
131074 (0x20002) AND d,a,b 
ILHU b,2 
131072 (0x20000) AND d,a.b 
ILA b, 134000 
134000 (0x20B70) AND d,a,b 
ILHU b,4 
262780 (0x4027C) IOHL b, 636 
AND d,a,b 
FSMBI b, OxFOOF 
(OxFFFFFFFF, 0x0, 0x0, OxFFFFFFFF) AND d,a, b 





e If bis a variable (non-literal) integer, code to splat the integer across the entire vector is generated followed 
by the non-immediate form of the instruction. For example, if b is an integer of unknown value, the constant 
area is loaded with the shuffle pattern (0x10203, 0x10203, 0x10203, 0x10203) at “CONST AREA, 
offset” and the following instructions are generated: 





LOD pattern, CONST AREA, offset 
SHUFB b, b, b, pattern 
AND d, ay b 





2.2.2. Implicit Conversion of Arguments of Intrinsics 


There is no implicit conversion of arguments which have a vector type. Arguments of scalar type are converted 
according to the rules specified in the C/C++ standards. Consider, for example, 


d = spu_insert(a, b, element); 


Scalar a is inserted into the element of vector b that is specified by the element parameter. When bis a vector 
double, a must be converted to double, element must be converted to int, and d must be a vector double. 


2.2.3. Notations and Conventions 
The remaining documentation describing the generic intrinsics uses the following rules and naming conventions: 


e The table associated with each generic intrinsic specifies the supported input types. 


e For intrinsics with scalar operands, only the immediate form of the instruction is shown. The other forms can 
be deduced in accordance with the rules discussed in section “2.2.1. Mapping Intrinsics with Scalar 
Operands”. 
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e Some intrinsics, whether specific or generic, map to assembly instructions that do not uniquely specify all 
input and output registers. Instead, an input register also serves as the output register. Examples of these 
assembly instructions include ADDX, DFMS, MPYHHA, and SFX. For these intrinsics, the notation rt <--- c 
is used to imply that a register-to-register copy (copy c to rt) might be required to satisfy the semantics of 
the intrinsic, depending on the inputs and outputs. No copies will be generated if input c is the same as 


output d. 


e Generic intrinsics that do not map to specific intrinsics are identified by the acronym “N/A” (not applicable) in 


the Specific Intrinsics column of the respective table. 


2.3. Constant Formation Intrinsics 


spu_splats: Splat Scalar To a Vector 


d = spu_splats(a) 


A single scalar value is replicated across all elements of a vector of the same type. The result is returned in vector d. 


Table 2-13: Splat Scalar To a Vector 











F Retüm/Argument Types : Specific Intrinsics Assembly Mapping 
vector unsigned char unsigned char 
vector signed char signed char 
vector unsigned short unsigned short 
vector signed short signed short 
vector unsigned int unsigned int 
: ; ; ; N/A SHUFB d, a, a, pattern 
vector signed int signed int 
vector unsigned long long unsigned long long 
vector signed long long signed long long 
vector float float 
vector double double 
vector unsigned char unsigned char (literal) ILd,a 
vector signed char signed char (literal) or 
vector unsigned short unsigned short (literal) ILA d, a 
vector signed short signed short (literal) ILH a RUTE 
vector unsigned int unsigned int (literal) NJA or 
vector signed int signed int (literal) ILHU d, a>>16 
vector unsigned long long unsigned long long (literal) E a>>16; 
vector signed long long signed long long (literal) IOHL d, a 
vector float float (literal) or 
FSMBI d,a 


vector double 


double (literal) 
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2.4. Conversion Intrinsics 


spu_convtf: Convert Vector To Float 


d = spu_convtf (a, scale) 


Each element of vector a is converted to a floating-point value and divided by 2° The allowable range for scale 


is 0 to 127. Values outside this range are flagged as an error and compilation is terminated. The result is returned in 
vector d. 


Table 2-14: Convert an Integer Vector To a Vector Float 





RetumArgumėnt Types Specific Intrinsics Assembly Mapping 
d a scale 
vector float vector unsigned int 








unsigned int (7-bit literal) d= si_cuflt(a, scale) CUFLT d, a, scale 


vector float vector signed int d= si_csflt(a, scale) CSFLT d, a, scale 





spu_convts: Convert Floating-Point Vector To Signed Integer Vector 


d = spu_convts(a, scale) 


Each element of vector a is scaled by 2° and the result is converted to a signed integer. If the intermediate result 
is greater than 231-1, the result saturates to 2°"-1. If the intermediate value is less than oo". the result saturates to - 
2° The allowable range for scale is 0 to 127. Values outside this range are flagged as an error and compilation is 
terminated. The results are returned in the corresponding elements of vector d. 


Table 2-15: Convert a Vector Float To a Signed Integer Vector 





Return/Argument Types 


d a scale 
vector signed int vector float unsigned int (7-bit literal) 


Specific Intrinsics Assembly Mapping 








d=si_cflts(a, scale) CFLTSd,a, scale 





spu_convtu: Convert Floating-Point Vector To Unsigned Integer Vector 


d = spu_convtu(a, scale) 

Each element of vector a is scaled by 25° and the result is converted to an unsigned integer. If the intermediate 
result is greater than 277.4, the result saturates to 2%-1. If the intermediate value is negative, the result saturates to 
zero. The allowable range for scale is 0 to 127. Values outside this range are flagged as an error and compilation 
is terminated; otherwise, the result is returned in the corresponding element of vector d. 


Table 2-16: Convert a Vector Float To an Unsigned Integer Vector 





Return/Argument Types 


d a scale 
vector unsigned int vector float unsigned int (7-bit literal) 


Specific Intrinsics Assembly Mapping 








d=si_cfltu(a, scale) CFLTUd,a, scale 





spu_extend: Sign Extend Vector 


d = spu_extend(a) 


For a fixed-point vector a, each odd element of vector a is sign extended and returned in the corresponding element 


of vector d. For a floating-point vector, each even element of a is sign extended and returned in the corresponding 
element of d. 


Table 2-17: Sign Extend Vector 











T Types Specific Intrinsics | Assembly Mapping 
a 

vector signed short vector signed char d= si_xsbh(a) XSBH d, a 

vector signed int vector signed short d= si_xshw(a) XSHW d, a 
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Belum Argumen Types Specific Intrinsics | Assembly Mapping 
a 

vector signed long long vector signed int d= si_xswd(a) XSWD d,a 

vector double vector float d= si_fesd(a) FESD d, a 





spu_roundtf: Round Vector Double To Vector Float 
d = spu_roundtf (a) 

Each doubleword element of vector a is rounded to a single-precision floating-point value and placed in the even 

element of vector d. Zeros are placed in the odd elements of a. 


Table 2-18: Round a Vector Double To a Float 











Return/Argument Types Specific Assembly Mapping 
d a Intrinsics 
vector float vector double d= si_frds(a) FRDS d, a 





2.5. Arithmetic Intrinsics 


spu_add: Vector Add 
d= 


spu_add(a, b) 


Each element of vector a is added to the corresponding element of vector b. If b is a scalar, the scalar value is 
replicated for each element and then added to a. Overflows and carries are not detected, and no saturation is 
performed. The results are returned in the corresponding elements of vector d. 


Table 2-19: Vector Add 





d 


Return/Argument Types 


a 


b 


Specific Intrinsics Assembly Mapping 








vector signed int 
vector unsigned int 
vector signed short 
vector unsigned short 
vector signed int 
vector unsigned int 
vector signed int 
vector unsigned int 
vector signed short 
vector unsigned short 
vector signed short 
vector unsigned short 
vector float 

vector double 


vector signed int 
vector unsigned int 
vector signed short 
vector unsigned short 
vector signed int 
vector unsigned int 
vector signed int 
vector unsigned int 
vector signed short 
vector unsigned short 
vector signed short 
vector unsigned short 
vector float 

vector double 


vector signed int 
vector unsigned int 
vector signed short 
vector unsigned short 
10-bit signed int 
(literal) 

int 

unsigned int 


10-bit signed short 
(literal) 


short 
unsigned short 
vector float 
vector double 


d= Si_a(a, b) Ad, a, b 
d= si_ah(a, b) AH d, a, b 
d= Si_ai(a, b) Ald, a, b 


See section “2.2.1. Mapping Intrinsics 
with Scalar Operands’. 


d= si_ahi(a, b) AHI d, a, b 


See section “2.2.1. Mapping Intrinsics 
with Scalar Operands’. 

FAd, a,b 

DFA d, a, b 


d= si_fa(a, b) 
d= si_dfa(a, b) 
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spu_addx: Vector Add Extended 


d = spu_addx(a, b, c) 


Each element of vector a is added to the corresponding element of vector b and to the least significant bit of the 
corresponding element of vector c. The result is returned in the corresponding element of vector d. 


Table 2-20: Vector Add Extended 





Return/Argument Types Specific Assembly Mapping 
d a b c Intrinsics 
vector signed int vector signed int 








vector signed int vector signedint g= gj addx( rt <--- C 


TET E ET RE b, c) ADDX rt, a, b 
vector unsigned int vector unsigned int vector unsigned int vector unsigned int TEREE d <--- rt 





spu_genb: Vector Generate Borrow 


d = spu_genb (a, b) 


Each element of vector b is subtracted from the corresponding element of vector a. The resulting borrow out is 
placed in the least significant bit of the corresponding element of vector d. The remaining bits of d are set to 0. 


Table 2-21: Vector Generate Borrow 











F Return/Argument Types 7 Specific Intrinsics Assembly Mapping 
a 

vector signed int vector signed in vector signed int d= si_bg(b, a) BG rt, b, a 

vector unsigned int vector unsigned int vector unsigned int 





spu_genbx: Vector Generate Borrow Extended 


d = spu_genbx (a, b, c) 


Each element of vector b is subtracted from the corresponding element of vector b. An additional 1 is subtracted 
from the result if the least significant bit of the corresponding element of vector c is 0. If the result is less than 0, a 1 
is placed in the corresponding element of vector d; otherwise, a 0 is placed in the corresponding element of d. 


Table 2-22: Vector Generate Borrow Extended 











Return/Argument Types Specific Assembly Mapping 
d a b c Intrinsics 
vector signed int vector signed int vector signed int vector signed int d= si_bgx( rt <--- C 
g BGX rt, b, a 
vector unsigned int vector unsigned int vector unsigned int vector unsigned int b, a, c) d <--- rt 





spu_genc: Vector Generate Carry 


d = spu_genc (a, b) 


Each element of vector a is added to the corresponding element of vector b. The resulting carry out is placed in the 
least significant bit of the corresponding element of vector d. The remaining bits of d are set to 0. 


Table 2-23: Vector Generate Carry 











q Return/Argument Types 7 Specific Intrinsics Assembly Mapping 
a 
vector signed int vector signed int vector signed int 


d= si_cg(a, b) CG rt, a, b 
vector unsigned int vector unsigned int 


vector unsigned int 





C/C++ Language Extensions for Cell Broadband Engine™ Architecture, Version 2.3 


SONY 


SONY 


SPU Low-Level Specific and Generic Intrinsics 


SOMEMER © 


spu_gencx: Vector Generate Carry Extended 


d = spu_gencx(a, b, c) 


Each element of vector a is added to the corresponding element of vector b and the least significant bit of the 
corresponding element of vector c. The resulting carry out is placed in the least significant bit of the corresponding 
element of vector d. The remaining bits of d are set to 0. 


Table 2-24: Vector Generate Carry Extended 











Return/Argument Types Specific Assembly Mapping 
d a b c Intrinsics 
vector signed int vector signed int vector signed int vector signedint g= gj cgx( rt <--- C 
TEB o) CGX rt, a, b 
vector unsigned int vector unsigned int vector unsigned int vector unsigned int d d <--- rt 





spu_madd: Vector Multiply and Add 


d = spu_madd(a, b, c) 


Each element of vector a is multiplied by vector b and added to the corresponding element of vector c and returned 
to the corresponding element of vector d. For integer multiply-and-adds, the odd elements of vectors a and b are 
sign extended to 32-bit integers prior to multiplication. 


Table 2-25: Vector Multiply and Add 











Return/Argument Types Specific Assembiy Mannin 
d a b c Intrinsics y Mapping 
vector signed int vector signed short vector signed short vector signed int a= i MPYA d, a, b, c 
vector float vector float vector float vector float a e o) FMA d, a, b,c 
a=si_dfma( tS e 
vector double vector double vector double vector double Tp o) DFMA rt, a, b 
nee d <--- rt 





spu_mhhadd: Vector Multiply High High and Add 
d = spu_mhhadd(a, b, c) 


Each even element of vector a is multiplied by the corresponding even element of vector b, and the 32-bit result is 
added to the corresponding element of vector c and returned in the corresponding element of vector d. 


Table 2-26: Vector Multiply High High and Add 











q RetumArgument ” Specific Intrinsics Assembly Mapping 
a 
d= si_mpyhha( TE 
vector signed int vector signed short vector signed short vector signed int Ta D o) MPYHHA rt, a, b 
i d <--- rt 
vector unsigned vector unsigned vector unsigned vector unsigned d= si_mpyhhau( Aaa itab 
int short short int a b,c) ger gi 
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spu_msub: Vector Multiply and Subtract 
d = spu_msub (a, b, c) 


Each element of vector a is multiplied by the corresponding element of vector b, and the corresponding element of 
vector c is subtracted from the product. The result is returned in the corresponding element of vector d. 


Table 2-27: Vector Multiply and Subtract 











RetünArgument Types Specific Intrinsics Assembly Mapping 
d a b c 
vector float vector float vector float vector float d= si_fms(a, b, c) FMS d, a,b,c 
rt <--- C 
vector double vector double vector double vector double d= si_dfms(a, b, c) DFMS rt, a, b 
d <--- rt 





spu_mul: Vector Multiply 
d = spu mul (a, b) 


Each element of vector a is multiplied by the corresponding element of vector b and returned in the corresponding 
element of vector d. 


Table 2-28: Vector Multiply 











J Return/Argument Types b Specific Intrinsics Assembly Mapping 
a 

vector float vector float vector float d=si_fm(a, b) FM d, a, b 

vector double vector double vector double d= si_dfm(a, b) DFM d, a, b 





spu_mulh: Vector Multiply High 
d = spu_mulh(a, b) 


Each even element of vector a is multiplied by the next (odd) element of vector b. The product is shifted left by 16 
bits and stored in the corresponding element of vector d. Bits shifted out at the left are discarded. Zeros are shifted 
in at the right. 


Table 2-29: Vector Multiply High 











d Return/Argument Types 7 Specific Intrinsics Assembly Mapping 
a 
vector signed int vector signed short vector signed short d= si_mpyh(a, b) MPYH d, a, b 





spu_mule: Vector Multiply Even 


d = spu _ mule (a, b) 


Each even element of vector a is multiplied by the corresponding even element of vector b, and the 32-bit result is 
put to the corresponding element of vector a. 


Table 2-30: Vector Multiply Even 











d Return/Argument Types 7 Specific Intrinsics Assembly Mapping 
a 
vector signed int vector signed short vector signed short d=si_mpyhh(a, b) MPYHH d, a, b 


vector unsigned int vector unsigned short vector unsigned short d=si_mpyhhu(a, b) MPYHHU d, a, b 
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spu_mulo: Vector Multiply Odd 
d = spu_mulo(a, b) 


Each odd-number element of vector a is multiplied by the corresponding element of vector b. If b is a scalar, the 
scalar value is replicated for each element and then multiplied by a. The results are returned in vector d. 


Table 2-31: Vector Multiply Odd 





Return/Argument Types 


d b Specific Intrinsics | Assembly Mapping 
a 








vector signed short d= si_mpy (a, b) MPY d, a, b 
10-bit signed short (literal) da= si_mpyi(a, b) MPYI d, a, b 


See section “2.2.1. Mapping Intrinsics 
with Scalar Operands”. 


vector signed int vector signed short 
signed short 


vector unsigned short d=si_mpyu(a, b) MPYUd,a,b 
vector unsigned int vector unsigned short 10-bit signed short (literal) ©d=si_mpyui(a, b)  MPYUI d, a,b 
See section “2.2.1. Mapping Intrinsics 


unsigned short with Scalar Operands”. 





spu_mulsr: Vector Multiply and Shift Right 


d = spu_mulsr (a, b) 


Each odd element of vector a is multiplied by the corresponding odd element of vector b. The leftmost 16 bits of the 
32-bit resulting product is sign extended and returned in the corresponding 32-bit element of vector d. 


Table 2-32: Vector Multiply and Shift Right 











F Return/Argument Types 7 Specific Intrinsics Assembly Mapping 
a 
vector signed int vector signed short vector signed short d= si_mpys(a, b) MPYS d, a, b 





spu_nmadd: Negative Vector Multiply and Add 
d = spu_nmadd(a, b, c) 


Each element of vector a is multiplied by the corresponding element in vector b and then added to the 
corresponding element of vector c. The result is negated and returned in the corresponding element of vector d. 


Table 2-33: Negative Vector Multiply and Add 











Return/Argument Types Specific Intrinsics Assembly Mapping 
d a b c 
rt <-- C 
vector double vector double vector double vector double d=si_dfnma(a, b, c) DFNMA tt, a, b 
d <-- rt 
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spu_nmsub: Negative Vector Multiply and Subtract 


d = spu_nmsub (a, b, c) 


Each element of vector a is multiplied by the corresponding element in vector b. The result is subtracted from the 
corresponding element in c and returned in the corresponding element of vector d. 


Table 2-34: Negative Vector Multiply and Subtract 











Return/Argument Types Specific Intrinsics Assembly Mapping 
d a b c 
vector float vector float vector float vector float d = si_fnms(a, b, c) FNMS d, a,b,c 
rt <--- C 
vector double vector double vector double vector double d=si_dfnms(a, b, c) DFNMSrt, a, b 
d <--- rt 





spu_re: Vector Floating-Point Reciprocal Estimate 


d = spu_re(a) 


For each element of vector a, an estimate of its floating-point reciprocal is computed, and the result is returned in 
the corresponding element of vector d. The resulting estimate is accurate to 12 bits. 


Table 2-35: Vector Floating-Point Reciprocal Estimate 











e ines Specific Intrinsics | Assembly Mapping 
a 
vector float vector float t = si_frest(a) FREST d, a 


d= si_fi(a, t) Fi d,a, d 





spu_rsqrte: Vector Floating-Point Reciprocal Square Root Estimate 


d = spu rsqrte (a) 


For each element of vector a, an estimate of its floating-point reciprocal square root is computed, and the result is 
returned in the corresponding element of vector d. The resulting estimate is accurate to 12 bits. 


Table 2-36: Vector Floating-Point Reciprocal Square Root Estimate 











an Types Specific Intrinsics Assembly Mapping 
a 
vector float vector float t = si_frsqest(a) FRSQEST d, a 


d=si_fi(a, t) Fid, a, d 





spu_sub: Vector Subtract 


d = spu_sub (a, b) 


Each element of vector b is subtracted from the corresponding element of vector a. If a is a scalar, the scalar value 
is replicated for each element of a, and then b is subtracted from the corresponding element of a. Overflows and 
carries are not detected. The results are returned in the corresponding elements of vector d. 


Table 2-37: Vector Subtract 











F Return/Argument Types b Specific Intrinsics Assembly Mapping 
a 
vector signed short vector signed short vector signed short d=si_sfh(b, a) SFH d, b, a 
vector unsigned short vector unsigned short vector unsigned short 
vector signed int vector signed int vector signed int , 
; ; - - - - d=si_sf(b, a) SF d, b, a 
vector unsigned int vector unsigned int vector unsigned int 
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Return/Argument Types 
d a b 
vector signed int 


Specific Intrinsics Assembly Mapping 








10-bit signed int (literal) | Yt" signed int d=si_sfi(b,a) | SFld,b,a 


vector unsigned int vector unsigned int 
vector signed int int vector signed int See section “2.2.1. Mapping Intrinsics 
vector unsigned int unsigned int vector unsigned int with Scalar Operands”. 
vector signed short a , vector signed short . F 

: 10-bit signed short (literal) : d=si_sfhi(b, a)  SFHI d, b, a 
vector unsigned short vector unsigned short 
vector signed short short vector signed short See section “2.2.1. Mapping Intrinsics 
vector unsigned short unsigned short vector unsigned short With Scalar Operands”. 
vector float vector float vector float d= si_fs(a, b) FS d, a,b 
vector double vector double vector double d= si_dfs(a, b) DFS d, a, b 





spu_subx: Vector Subtract Extended 


d = spu_subx(a, b, c) 


Each element of vector b is subtracted from the corresponding element of vector a. An additional 1 is subtracted 
from the result if the least significant bit of the corresponding element of vector c is 0. The final result is returned in 
the corresponding element of vector d. 


Table 2-38: Vector Subtract Extended 











Return/Argument Types Specific Intrinsics Assembly 
d a b c Mapping 
vector signed int vector signed int vector signed int vector signed int rt <--- C 
d= si_sfx(b, a, c) SFX rt, b,a 
vector unsigned int vector unsigned int vector unsigned int vector unsigned int d <—- rt 





2.6. Byte Operation Intrinsics 
spu_absd: Element-Wise Absolute Difference 
d = spu_absd(a, b) 


Each element of vector a is subtracted from the corresponding element of vector b, and the absolute value of the 
result is returned in the corresponding element of vector d. 


Table 2-39: Element-Wise Absolute Difference 





Return/Argument Types 
d a b 
vector unsigned char vector unsigned char vector unsigned char d=si_absdb(a,b)  ABSDB d, a, b 


Specific Intrinsics Assembly Mapping 











spu_avg: Average of Two Vectors 


d = spu_avg(a, b) 


Each element of vector a is added to the corresponding element of vector b plus 1. The result is shifted to the right 
by 1 bit and placed in the corresponding element of vector d. 


Table 2-40: Average of Two Vectors 











R Return/Argument Types A Specific Intrinsics Assembly Mapping 
a 
vector unsigned char vector unsigned char vector unsigned char d= si_avgb(a, bÞ)  AVGB d,a, b 
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spu_sumb: Sum Bytes into Shorts 


d = spu _sumb (a, b) 


Each four elements of b are summed and returned in the corresponding even elements of vector d. Each four 
elements of a are summed and returned in the corresponding odd elements of d. 


Table 2-41: Sum Bytes into Shorts 











F Return/Argument Types T Specific Intrinsics Assembly Mapping 
a 
vector unsigned short vector unsigned char vector unsigned char d= si_sumb(a, b) SUMB d, a, b 





2.7. Compare, Branch and Halt Intrinsics 


spu_bisled: Branch Indirect and Set Link if External Data 


(void) spu bisled (func) 

(void) spu bisled d (func) 

(void) spu bisled e (func) 

The count value of channel 0 (event status) is examined. If it is zero, execution continues with the next sequential 
instruction. If it is nonzero, the function func is called. The parameter func is the name of, or pointer to, a 
parameter-less function with no return value. If func is called, the spu bisled_d and spu_bisled e forms of 
the intrinsic do one of the following actions: 


e Disable interrupts — use spu bisled d 


e Enable interrupts — use spu bisled e 


Programming Note: Because the bisled instruction is assumed to behave as a synchronous software interrupt, 
standard calling conventions are not observed because all volatile registers must be considered non-volatile by the 
bisled target function, func. See the SPU Application Binary Interface Specification for additional details about 
standard calling conventions. 


With respect to branch prediction, it is assumed that func is not called. Therefore, a branch hint instruction will not 
be inserted as a result of the spu_bisled() intrinsic. 


Table 2-42: Branch Indirect and Set Link If External Data 











Generic Intrinsic Form func Specific Intrinsics Assembly Mapping 
spu_bisled si_bisled( func) BISLED $LR, func 
spu_bisled_d void (*func) () si_bisledd( func) BISLEDD $LR, func 
spu_bisled_e si_bislede( func) BISLEDE $LR, func 





spu_cmpabseq: Element-Wise Compare Absolute Equal 


d = spu_cmpabseq(a, b) 


The absolute value of each element of vector a is compared with the absolute value of the corresponding element of 
vector b. If the absolute values are equal, the corresponding element of vector dis set to all ones; otherwise, the 
corresponding element of dis set to all zeros. 


Table 2-43: Element-Wise Compare Absolute Equal 





Return/Argument Types 
d a b 
vector unsigned int vector float vector float d= si_fcmeq(a, b) FCMEQ d, a, b 


Specific Intrinsics Assembly Mapping 
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spu_cmpabsgt: Element-Wise Compare Absolute Greater Than 
d = spu_cmpabsgt (a, b) 


The absolute value of each element of vector a is compared with the absolute value of the corresponding element of 
vector b. If the element of a is greater than the element of b, the corresponding element of vector dis set to all 
ones; otherwise, the corresponding element of dis set to all zeros. 


Table 2-44: Element-Wise Compare Absolute Greater Than 











F Retürn/Argument Types i Specific Intrinsics Assembly Mapping 
a 
vector unsigned int vector float vector float d = si_fcmgt(a, b) FCMGT d, a, b 





spu_cmpeq: Element-Wise Compare Equal 
d = spu cmpeq (a, b) 


Each element of vector a is compared with the corresponding element of vector b. If b is a scalar, the scalar value is 


first replicated for each element, and then a and b are compared. If the operands are equal, all bits of the 
corresponding element of vector d are set to one. If they are unequal, all bits of the corresponding element of d are 


set to zero. 


Table 2-45: Element-Wise Compare Equal 





d 


Return/Argument Types 


a 


b 


Specific Intrinsics Assembly Mapping 








vector unsigned char 


vector unsigned short 


vector unsigned int 


vector unsigned char 


vector unsigned short 


vector unsigned int 


vector signed char 
vector unsigned char 
vector signed short 
vector unsigned short 
vector signed int 
vector unsigned int 
vector float 

vector signed char 
vector unsigned char 
vector signed char 
vector unsigned char 
vector signed short 
vector unsigned short 
vector signed short 
vector unsigned short 
vector signed int 
vector unsigned int 
vector signed int 
vector unsigned int 


vector signed char 
vector unsigned char 
vector signed short 
vector unsigned short 
vector signed int 
vector unsigned int 
vector float 


10-bit signed int (literal) 


signed char 
unsigned char 


10-bit signed int (literal) 


signed short 
unsigned short 


10-bit signed int (literal) 


signed int 
unsigned int 


d= si_ceqb(a, b) CEQB d, a, b 
d= si_ceqh(a, b) CEQH d, a, b 
d= si_ceq(a, b) CEQ d, a, b 

d= si_fceq(a, b) FCEQ d, a, b 
d= si_ceqbi(a, b) CEQBI d, a, b 


See section “2.2.1. Mapping Intrinsics 
with Scalar Operands”. 
d= si_ceghi(a, b) CEQHI d, a, b 


See section “2.2.1. Mapping Intrinsics 
with Scalar Operands”. 
d= si_ceqi(a, b) CEQI d, a, b 


See section “2.2.1. Mapping Intrinsics 
with Scalar Operands”. 
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Each element of vector a is compared with the corresponding element of vector b. If b is a scalar, the scalar value is 
replicated for each element and then a and b are compared. If the element of a is greater than the corresponding 
element of b, all bits of the corresponding element of vector d are set to one; otherwise, all bits of the corresponding 


element of d are set to zero. 


Table 2-46: Element-Wise Compare Greater Than 











F Return/Argument Types A Specific Intrinsics Assembly Mapping 
a 
vector signed char d= si_cgtb(a, b) CGTB d, a, b 
10-bit signed int (literal) d= si_cgtbi(a, b) CGTBI d, a, b 


vector signed char 


vector unsigned char 


vector unsigned char 


vector signed short 


vector unsigned short 


vector unsigned short 


vector signed int 


vector unsigned int 


vector unsigned int 


vector float 


signed char 

vector unsigned char 
10-bit signed int (literal) 
unsigned char 

vector signed short 
10-bit signed int (literal) 
signed short 

vector unsigned short 
10-bit signed int (literal) 
unsigned short 

vector signed int 

10-bit signed int (literal) 
signed int 

vector unsigned int 
10-bit signed int (literal) 
unsigned int 


vector float 


See section “2.2.1. Mapping Intrinsics 
with Scalar Operands”. 


d= si_clgtb(a, b) CLGTB d, a, b 
CLGTBI d, a, b 


See section “2.2.1. Mapping Intrinsics 
with Scalar Operands’. 


d= si_cgth(a, b) CGTH d, a, b 
CGTHI d, a, b 


See section “2.2.1. Mapping Intrinsics 
with Scalar Operands”. 


d= si_clgth(a, b) CLGTH d, a, b 
CLGTHI d, a, b 


See section “2.2.1. Mapping Intrinsics 
with Scalar Operands’. 


d= si_cgt(a, b) CGT d, a, b 
d= si_cgti(a, b) CGTI d,a, b 


See section “2.2.1. Mapping Intrinsics 
with Scalar Operands”. 


d= si_clgt(a, b) CLGT d, a, b 
d= si_clgti(a, b) CLGTI d, a, b 


See section “2.2.1. Mapping Intrinsics 
with Scalar Operands’. 


d= si_fcgt(a, b) FCGT d, a,b 


d= si_clgtbi(a, b) 


d= si_cgthi(a, b) 


d= si_clgthi(a, b) 
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spu_hcmped: Halt If Compare Equal 
(void) spu_hcmpeq (a, b) 


The contents of a and b are compared. If they are equal, execution is halted. 


Table 2-47: Halt If Compare Equal 











Return/Argument Types Specific Intrinsics Assembly Mapping’? 
a 
int l l int (ngn-literal) l si_heq(a, b) HEQ rt, a, b 
unsigned int unsigned int (non-literal) 
int 
: ; 10-bit signed int (literal) si_heqi(a, b) HEQI rt, a, b 
unsigned int 





1 Immediate values that cannot be represented as a 10-bit signed value are constructed similar to the method described in 
section “2.2.1. Mapping Intrinsics with Scalar Operands” on page 13. 


? The false target parameter rt is optimally chosen depending on the register usage of neighboring instructions. 


spu_hcmpgt: Halt If Compare Greater Than 
(void) spu_hcmpgt (a, b) 


The contents of a and b are compared. If ais greater than b, execution is halted. 


Table 2-48: Halt If Compare Greater Than 











Return/Argument Types Specific Intrinsics Assembly Mapping“? 
a 
int int (non-literal) si_hgt(a, b) HGT rt, a, b 
unsigned int unsigned int (non-literal) si_hlgt(a, b) HLGT rt, a, b 
int 10-bit signed int (literal) si_hgti(a, b) HGTI rt, a, b 
unsigned int 10-bit signed int (literal) si_hlgti(a, b) HLGTI rt, a, b 





1 Immediate values that cannot be represented as 10-bit signed values are constructed in a way similar to the method 
described in section “2.2.1. Mapping Intrinsics with Scalar Operands” on page 13. 


? The false target parameter rt is optimally chosen depending on the register usage of neighboring instructions. 


2.8. Bits and Mask Intrinsics 


spu_cntb: Vector Count Ones for Bytes 
d = spu_cntb (a) 
For each element of vector a, the number of ones are counted, and the count is placed in the corresponding 


element of vector d. 


Table 2-49: Vector Count Ones for Bytes 











d PEA UOE TPE Specific Intrinsics Assembly Mapping 
a 
- vector unsigned char , 
vector unsigned char si_cntb CNTB d, a 


vector signed char 
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spu_cntlz: Vector Count Leading Zeros 


d = spu_cntlz (a) 


For each element of vector a, the number of leading zeros is counted, and the resulting count is placed in the 
corresponding element of vector d. 


Table 2-50: Vector Count Leading Zeros 











; Return/Argument Types : a aa Assembly Mapping 
vector signed int 
vector unsigned int vector unsigned int d=si_clz(a) CLZ d, a 
vector float 





spu_gather: Gather Bits From Elements 


d = spu gather (a) 


The rightmost bit (LSB) of each element of vector a is gathered, concatenated, and returned in the rightmost bits of 
element 0 of vector a. For a byte vector, 16 bits are gathered; for a halfword vector, 8 bits are gathered; and for a 
word vector, 4 bits are gathered. The remaining bits of element 0 of d and all other elements of that vector are 
zeroed. 


Table 2-51: Gather Bits From Elements 


Return/Argument Types Specific 
d a Intrinsics 





Assembly Mapping 








vector unsigned char d= si_gbb(a) GBB d, a 
vector signed char 
vector unsigned short 

vector unsigned int vector signed short 
vector unsigned int 
vector signed int d= si_gb(a) GB d, a 
vector float 


d= si_gbh(a) GBH d, a 





spu_maskb: Form Select Byte Mask 


d = spu_maskb (a) 


For each of the least significant 16 bits of a, each bit is replicated 8 times, producing a 128-bit vector mask that is 
returned in vector d. 


Table 2-52: Form Select Byte Mask 





Return/Argument Types Specific 


F a intrinsics Assembly Mapping 








unsigned short 

oe d=si_fsmb(2) FSMB d, a 
vector unsigned char unsigned int 

signed int 


16-bit unsigned int (literal) d= si_fsmbi(a) FSMBI d, a 
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spu_maskh: Form Select Halfword Mask 


spu 


d = spu_maskh (a) 


For each of the least significant 8 bits of a, each bit is replicated 16 times, producing a 128-bit vector mask that is 
returned in vector d. 


Table 2-53: Form Select Halfword Mask 





Return/Argument Types 


d Specific Intrinsics Assembly Mapping 
a 


unsigned char 








signed char 
unsigned short 
signed short 


vector unsigned short d= si_fsmh(a) FSMH d, a 


unsigned int 
signed int 





_maskw: Form Select Word Mask 


d = spu maskw (a) 


For each of the least significant 4 bits of a, each bit is replicated 32 times, producing a 128-bit vector mask that is 
returned in vector d. 


Table 2-54: Form Select Word Mask 





Return/Argument Types 


E Specific Intrinsics Assembly Mapping 
a 








unsigned char 
signed char 


igned short ; 
vector unsigned int Unsigned S10 d= si_fsm(a) FSM d, a 
signed short 


unsigned int 
signed int 
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spu_sel: Select Bits 
d = spu_sel(a, b, pattern) 


For each bit in the 128-bit vector pattern, the corresponding bit from either vector a or vector b is selected. If the 
bit is O, the bit from a is selected; otherwise, the bit from b is selected. The result is returned in vector d. 


Table 2-55: Select Bits 


SONY 











Return/Argument Types Specific Assembly 
d a b pattern Intrinsics Mapping 
vector unsigned char vector unsigned char vector unsigned char vector unsigned 
vector signed char vector signed char vector signed char char 


vector unsigned short vector unsigned short vector unsigned short vector unsigned 
vector signed short vector signed short vector signed short short 
vector unsigned int vector unsigned int vector unsigned int 


i d= si_selb( 
vector signed int vector signed int vector signed int vector unsigned SELB d, a, b, 


int â; b pattern 
vector float vector float vector float pattern) 
vector unsigned vector unsigned vector unsigned 
long long long long long long 
. . : vector unsigned 
vector signed vector signed vector signed 
long long 
long long long long long long 
vector double vector double vector double 





spu_shuffle: Shuffle Two Vectors of Bytes 
d = spu_shuffle(a, b, pattern) 


For each byte of pattern, the byte is examined, and a byte is produced, as shown in Figure 2-2. The result is 
returned in the corresponding byte of vector d. 


Figure 2-2: Shuffle Pattern 











Value in the Byte of Pattern (in binary) Resulting Byte 

10XXXXXX 0x00 

110xxxxx OxFF 

111xxxxx 0x80 

otherwise the byte of (a| |b) addressed by the rightmost 5 bits of pattern 
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Table 2-56: Shuffle Two Vectors of Bytes 











Return/Argument Types Specific Intrinsics Assembly 
d a b pattern Mapping 
vector unsigned vector unsigned vector unsigned 
char char char 
vector signed char vector signed char vector signed char 
vector unsigned vector unsigned vector unsigned 
short short short 
vector signed short vector signed short vector signed short 
vector unsigned int vector unsigned int vector unsigned int vector d= si_shufb( SHUEFB d, a, b, 
vector signed int vector signed int vector signed int unsigned char a, b, pattern) pattern 
vector unsigned vector unsigned vector unsigned 
long long long long long long 
vector signed vector signed vector signed 
long long long long long long 
vector float vector float vector float 
vector double vector double vector double 





2.9. Logical Intrinsics 


spu_and: Vector Bit-Wise AND 
d = spu_and(a, b) 


Each bit of vector a is logically ANDed with the corresponding bit of vector b. If b is a scalar, the scalar value is 
first replicated for each element, and then a and b are ANDed. The results are returned in the corresponding bit of 
vector d. 


Table 2-57: Vector Bit-Wise AND 











d ee Types 5 Specific Intrinsics Assembly Mapping 

vector unsigned char vector unsigned char vector unsigned char 

vector signed char vector signed char vector signed char 

vector unsigned short vector unsigned short vector unsigned short 

vector signed short vector signed short vector signed short 

vector unsigned int vector unsigned int vector unsigned int 

vector signed int vector signed int vector signed int d= si_and(a, b) AND d, a, b 

vector unsigned vector unsigned 


vector unsigned long long long long long long 


vector signed 


vector signed long long vector signed long long long long 
vector float vector float vector float 
vector double vector double vector double 
vector unsigned char vector unsigned char -bit si i 

=e Sy 10-bit signed int d= si_andbi(a, b) ANDBI d, a, b 
vector signed char vector signed char (literal) 
vector unsigned char vector unsigned char unsigned char See section “2.2.1. Mapping Intrinsics 
vector signed char vector signed char signed char with Scalar Operands”. 
vector unsigned short vector unsigned short -bit si i 

=a = Ai oc d= si_andhi(a, b) | ANDHId, a, b 
vector signed short vector signed short (literal) 
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Return/Argument Types 
d a b 
vector unsigned short vector unsigned short 


Specific Intrinsics Assembly Mapping 








unsigned short See section “2.2.1. Mapping Intrinsics 


vector signed short vector signed short signed short with Scalar Operands”. 
vector unsigned int vector unsigned int -bit si i i : 

eee ane 10-bit signed int d=si_andi(a,b) ANDI d, a, b 
vector signed int vector signed int (literal) 
vector unsigned int vector unsigned int unsigned int See section “2.2.1. Mapping Intrinsics 
vector signed int vector signed int signed int with Scalar Operands”. 





spu_andc: Vector Bit-Wise AND with Complement 
d 


spu_andc (a, b) 


Each bit of vector a is ANDed with the complement of the corresponding bit of vector b. The result is returned in the 
corresponding bit of vector d. 


Table 2-58: Vector Bit-Wise AND with Complement 











Return/Argument Types Specific Assembly 
d 7 b Intrinsics Mapping 
vector unsigned char vector unsigned char vector unsigned char 
vector signed char vector signed char vector signed char 
vector unsigned short vector unsigned short vector unsigned short 
vector signed short vector signed short vector signed short 
vector unsigned int vector unsigned int vector unsigned int d= m ANDC d, a, b 


vector signed int 

vector unsigned long long 
vector signed long long 
vector float 


vector double 


vector signed int 

vector unsigned long long 
vector signed long long 
vector float 


vector double 


vector signed int 

vector unsigned long long 
vector signed long long 
vector float 


vector double 





d 


spu_eqv(a, b) 


spu_eqv: Vector Bit-Wise Equivalent 


Each bit of vector a is compared with the corresponding bit of vector b. The corresponding bit of vector dis set to 1 
if the bits in a and b are equivalent; otherwise, the bit is set to 0. 


Table 2-59: Vector Bit-Wise Equivalent 





d 


Return/Argument Types 
a 


Specific Intrinsics Assembly Mapping 


b 








vector unsigned char 
vector signed char 

vector unsigned short 
vector signed short 
vector unsigned int 
vector signed int 

vector unsigned long long 
vector signed long long 
vector float 

vector double 


vector unsigned char 
vector signed char 

vector unsigned short 
vector signed short 
vector unsigned int 
vector signed int 

vector unsigned long long 
vector signed long long 
vector float 

vector double 


vector unsigned char 
vector signed char 

vector unsigned short 
vector signed short 
vector unsigned int 
vector signed int 

vector unsigned long long 
vector signed long long 
vector float 

vector double 


d=si_eqv(a, b) 


EQV d, a, b 
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spu_nand: Vector Bit-Wise Complement of AND 


d = spu_nand(a, b) 


SPU Low-Level Specific and Generic Intrinsics 


Each bit of vector a is ANDed with the corresponding bit of vector b. The complement of the result is returned in the 


corresponding bit of vector d. 


Table 2-60: Vector Bit-Wise Complement of AND 





Return/Argument Types 
d a 


Specific Intrinsics peony 
b Mapping 








vector unsigned char vector unsigned char 


vector signed char vector signed char 
vector unsigned short vector unsigned short 
vector signed short vector signed short 
vector unsigned int vector unsigned int 
vector signed int vector signed int 
vector unsigned long long vector unsigned long long 
vector signed long long vector signed long long 
vector float vector float 


vector double vector double 


vector unsigned char 
vector signed char 
vector unsigned short 
vector signed short 


vector unsigned int 


d= si_nand(a, b) NAND d, a, b 


vector signed int 

vector unsigned long long 
vector signed long long 
vector float 


vector double 





spu_nor: Vector Bit-Wise Complement of OR 


d = spu_nor (a, b) 


Each bit of vector a is ORed with the corresponding bit of vector b. The complement of the result is returned in the 


corresponding bit of vector d. 


Table 2-61: Vector Bit-Wise Complement of OR 





Return/Argument Types 
d a 


Specific Intrinsics AS EMDI 
b Mapping 








vector unsigned char vector unsigned char 


vector signed char vector signed char 
vector unsigned short vector unsigned short 
vector signed short vector signed short 
vector unsigned int vector unsigned int 
vector signed int vector signed int 


vector unsigned long long vector unsigned long long 


vector signed long long vector signed long long 


vector float vector float 


vector double vector double 


vector unsigned char 
vector signed char 
vector unsigned short 
vector signed short 


vector unsigned int 


d= si_nor(a, b) NOR d,a, b 


vector signed int 


vector unsigned long long 


vector signed long long 
vector float 


vector double 
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Each bit of vector a is logically ORed with the corresponding bit of vector b. If b is a scalar, the scalar value is first 
replicated for each element, and then a and b are ORed. The result is returned in the corresponding bit of vector d. 


Table 2-62: Vector Bit-Wise OR 











Return/Argument Types Specific Assembly Mapping 
d a b Intrinsics 

vector unsigned char vector unsigned char vector unsigned char 

vector signed char vector signed char vector signed char 

vector unsigned short vector unsigned short vector unsigned short 

vector signed short vector signed short vector signed short 

vector unsigned int vector unsigned int vector unsigned int i 

'9 ; 9 '9 ; d=si_or(a, b) ORd,a,b 


vector signed int 


vector unsigned long long 


vector signed long long 
vector float 

vector double 

vector unsigned char 
vector signed char 
vector unsigned char 
vector signed char 
vector unsigned short 
vector signed short 
vector unsigned short 
vector signed short 
vector unsigned int 
vector signed int 
vector unsigned int 


vector signed int 


vector signed int 


vector unsigned long long 


vector signed long long 
vector float 

vector double 

vector unsigned char 
vector signed char 
vector unsigned char 
vector signed char 
vector unsigned short 
vector signed short 
vector unsigned short 
vector signed short 
vector unsigned int 
vector signed int 
vector unsigned int 


vector signed int 


vector signed int 

vector unsigned long long 
vector signed long long 
vector float 


vector double 


10-bit signed int (literal) da= si_orbi(a, b) ORBI d, a, b 


unsigned char See section “2.2.1. Mapping 


signed char Intrinsics with Scalar Operands”. 


10-bit signed int (literal) a= si_orhi(a, b) ORHI d, a, b 


unsigned short See section “2.2.1. Mapping 


signed short Intrinsics with Scalar Operands”. 


10-bit signed int (literal) da= si_ori(a, b) ORI d, a, b 


unsigned int See section “2.2.1. Mapping 


signed int Intrinsics with Scalar Operands”. 
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spu_orc: Vector Bit-Wise OR with Complement 


d = spu_orc(a, b) 


SPU Low-Level Specific and Generic Intrinsics 


Each bit of vector a is ORed with the complement of the corresponding bit of vector b. The result is returned in the 


corresponding bit of vector d. 


Table 2-63: Vector Bit-Wise OR with Complement 





Return/Argument Types 
d a 


Specific Intrinsics Assembly 
b Mapping 








vector unsigned char 
vector signed char 


vector unsigned char 
vector signed char 

vector unsigned short vector unsigned short 
vector signed short vector signed short 
vector unsigned int vector unsigned int 
vector signed int vector signed int 
vector unsigned long long vector unsigned long long 
vector signed long long vector signed long long 
vector float vector float 


vector double vector double 


vector unsigned char 
vector signed char 
vector unsigned short 
vector signed short 


vector unsigned int 


d=si_orc(a, b) ORC d,a, b 


vector signed int 

vector unsigned long long 
vector signed long long 
vector float 

vector double 





spu_orx: OR Word Across 


d = spu_orx (a) 


The four word elements of vector a are logically ORed. The result is returned in word element 0 of vector a. All other 


elements (1,2,3) of d are assigned a value of zero. 


Table 2-64: OR Word Across 





Return/Argument Types 
d a 
vector unsigned int vector unsigned int 


vector signed int vector signed int 


d= si_orx(a) 


Specific Intrinsics Assembly Mapping 


ORX d, a 





spu_xor: Vector Bit-Wise Exclusive OR 


d = spu _ xor (a, b) 


Each element of vector a is exclusive-ORed with the corresponding element of vector b. If b is a scalar, the scalar 
value is first replicated for each element. The result is returned in the corresponding bit of vector d. 


Table 2-65: Vector Bit-Wise Exclusive OR 











Return/Argument Types Specific Assembly 

d a b Intrinsics Mapping 
vector unsigned char vector unsigned char vector unsigned char 
vector signed char vector signed char vector signed char 
vector unsigned short vector unsigned short vector unsigned short 
vector signed short vector signed short vector signed short 
vector unsigned int vector unsigned int vector unsigned int d= si_xor( XORd ab 
vector signed int vector signed int vector signed int a, b) a 


vector unsigned long long 
vector signed long long 
vector float 

vector double 


vector unsigned long long 
vector signed long long 
vector float 

vector double 


vector unsigned long long 
vector signed long long 
vector float 

vector double 
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Return/Argument Types Specific Assembly 
d a b Intrinsics Mapping 
vector unsigned char vector unsigned char 10-bit signed int (literal) d= si_xorbi( XORBI d, a, b 
vector signed char a, b) 


vector signed char 
vector unsigned char 


vector signed char 


vector unsigned short 
vector signed short 
vector unsigned short 
vector signed short 
vector unsigned int 
vector signed int 
vector unsigned int 


vector signed int 


vector unsigned char 
vector signed char 


vector unsigned short 
vector signed short 
vector unsigned short 


vector signed short 


vector unsigned int 
vector signed int 
vector unsigned int 


unsigned char 


signed char 
10-bit signed int (literal) 


unsigned short 


signed short 
10-bit signed int (literal) 


unsigned int 


See section “2.2.1. Mapping 
Intrinsics with Scalar 


Operands’. 
d=si_xorhi( | XORHI d, a, b 
a, b) +A, 


See section “2.2.1. Mapping 
Intrinsics with Scalar 


Operands’. 
d= si_xori( 
ao) XORI d, a, b 


See section “2.2.1. Mapping 


vector signed int 


signed int 


Intrinsics with Scalar 
Operands’. 





2.10. Shift and Rotate Intrinsics 


spu_rl: Element-Wise Rotate Left by Bits 


d = spu_rl(a, count) 


Each element of vector a is rotated left by the number of bits specified by the corresponding element in vector 
count. Bits rotated out of the left end of the element are rotated in at the right end. A limited number of count bits 
are used depending on the size of the element. For halfword elements, the 4 least significant bits of count are used. 
For word elements, the 5 least significant bits of count are used. 


The results are returned in the corresponding elements of vector d. 


Table 2-66: Element-Wise Rotate Left by Bits 





d 


Return/Argument Types 


a 


count 


Specific Intrinsics Assembly Mapping 








vector unsigned short 
vector signed short 
vector unsigned int 
vector signed int 
vector unsigned short 
vector signed short 
vector unsigned short 
vector signed short 
vector unsigned int 
vector signed int 
vector unsigned int 
vector signed int 


vector unsigned short 
vector signed short 
vector unsigned int 
vector signed int 
vector unsigned short 
vector signed short 
vector unsigned short 
vector signed short 
vector unsigned int 
vector signed int 
vector unsigned int 
vector signed int 


vector signed short 


vector signed int 


7-bit signed int (literal) 


int 


7-bit signed int (literal) 


int 


d=si_roth(a, count) ROTH d, a, count 


d= si_rot(a, count) ROT d, a, count 


d=si_rothi(a, count) ROTHI d, a, count 


See section “2.2.1. Mapping Intrinsics with 
Scalar Operands’. 
d=si_roti(a, count) ROTI d, a, count 


See section “2.2.1. Mapping Intrinsics with 
Scalar Operands’. 
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spu_rlmask: Element-Wise Rotate Left and Mask by Bits 


d = spu _rlmask (a, count) 


This function uses an element-wise rotate left and mask operation to perform a logical shift right (LSR) by bits of 
each element of vector a, where count represents the negated value, or values, of the desired corresponding right- 
shift amounts. (The count parameter can be either a vector or a scalar, as shown in Table 2-67.) For example, if 
scalar count is —5, each element of a is shifted right by 5 bits. The effect of this function is more precisely shown by 
the following code: 


For (each halfword element h in vector a) { 
int bitshift = -count & Oxl1F; 
h = (bitshift & 0x10)? 0: LSR(h,bitshift) ; 


For (each word element w in vector a) { 

int bitshift = -count & 0x3F; 

w = (bitshift & 0x20)? 0: LSR(w,bitshift); 
} 


The results are returned in the corresponding elements of vector d. 


Table 2-67: Element-Wise Rotate Left and Mask by Bits 





Return/Argument Types 


Specific Intrinsics Assembly Mapping 
d a count 








vector unsigned short vector unsigned short 
vector signed short vector signed short 
vector unsigned int vector unsigned int 
vector signed int vector signed int 


vector unsigned short vector unsigned short 
ny neu 7-bit signed int (literal) d= si_rothmi(a, count) ROTHMI d, a, count 
vector signed short vector signed short 


vector signed short d=si_rothm(a, count) ROTHM d, a, count 


vector signed int d=si_rotm(a, count) ROTM d, a, count 


vector unsigned short vector unsigned short it See section “2.2.1. Mapping Intrinsics with 
vector signed short vector signed short Scalar Operands”. 
vector unsigned int vector unsigned int 

: y ; . g 7-bit signed int (literal) d=si_rotmi(a, count) ROTMI d, a, count 
vector signed int vector signed int 
vector unsigned int vector unsigned int int See section “2.2.1. Mapping Intrinsics with 
vector signed int vector signed int Scalar Operands”. 





spu_rlmaska: Element-Wise Rotate Left and Mask Algebraic by Bits 


d = spu rlmaska (a, count) 


This function uses an element-wise rotate left and mask operation to perform an arithmetical shift right (ASR) of 
each element of vector a, where count represents the negated value, or values, of the desired corresponding right- 
shift amounts. (The count parameter can be either a vector or a scalar, as shown in Table 2-68.) For example, if 
scalar count is —5, each element of a is shifted right by 5 bits. The effect of this function is more precisely shown by 
the following code: 


For (each halfword element h in vector a) { 
int bitshift = -count & Oxl1F; 
h = (bitshift & 0x10)? 0: ASR(h,bitshift) ; 


For (each word element w in vector a) { 

int bitshift = -count & O0x3F; 

w = (bitshift & 0x20)? 0: ASR(w,bitshift); 
} 


The results are returned in the corresponding elements of vector d. 
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Return/Argument Types 
d a 


count 


Specific Intrinsics Assembly Mapping 








vector unsigned short vector unsigned short 
vector signed short vector signed short 


vector unsigned int vector unsigned int 


vector signed short 


vector signed int 


vector signed int vector signed int 


vector unsigned short vector unsigned short 


vector signed short vector signed short (literal) 


vector unsigned short vector unsigned short. i 
in 
vector signed short vector signed short 


vector unsigned int vector unsigned int 


vector signed int vector signed int (literal) 


vector unsigned int vector unsigned int ine 
in 


vector signed int vector signed int 


7-bit signed int 


7-bit signed int 


d=si_rotmah(a, count) ROTMAH d, a, count 


d=si_rotma(a, count) | ROTMA d, a, count 


d=si_rotmahi(a, count) ROTMAHI d, a, count 


See section “2.2.1. Mapping Intrinsics with 
Scalar Operands”. 
d=si_rotmai(a, count) ROTMAId, a, count 


See section “2.2.1. Mapping Intrinsics with 
Scalar Operands”. 





spu_rlmaskqw: Rotate Left and Mask Quadword by Bits 


d = spu_rlmaskqw(a, count) 


This function uses a rotate and mask quadword by bits operation to perform a quadword logical shift right (LSR) of 
up to 7 bits, where count represents the negated value of the desired right-shift amount. For example, if count is — 
5, vector a is shifted right by 5 bits. The effect of this function is more precisely shown by the following code: 


qword spu_rlmaskqw(qword a, 

{ int bitshift = -count & 0x7; 
return LSR(a,bitshift) ; 

} 


The resulting quadword is returned in vector d. 


Table 2-69: Rotate Left and Mask Quadword by Bits 


int count) 











Return/Argument Types Specific Intrinsics Assembly Mapping 
d a count 
vector unsigned char vector unsigned char 
vector signed char vector signed char 
vector unsigned short vector unsigned short 
vector signed short vector signed short 
vector unsigned int vector unsigned int int d= si_rotqmbii(a, count) 
; : . i : ROTQMBII d, a, count 
vector signed int vector signed int (literal) 


vector unsigned long long vector unsigned long long 


vector signed long long vector signed long long 
vector float vector float 


vector double vector double 
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Return/Argüment Types Specific Intrinsics Assembly Mapping 
d a count 
vector unsigned char vector unsigned char 
vector signed char vector signed char 
vector unsigned short vector unsigned short 
vector signed short vector signed short 
vector unsigned int vector unsigned int int l d=si_rotambi(a, count) | ROTQMBI d, a, count 
vector signed int vector signed int (non-literal) 
vector unsigned long long vector unsigned long long 
vector signed long long vector signed long long 
vector float vector float 
vector double vector double 





spu_rlmaskqwbyte: Rotate Left and Mask Quadword by Bytes 
d = spu_rlmaskqwbyte(a, count) 
This function uses a rotate and mask quadword by bytes operation to perform a quadword logical shift right (LSR) 


by bytes, where count represents the negated value of the desired byte right-shift amount. For example, if count 
is —5, vector a is shifted right by 5 bytes. The effect of this function is more precisely shown by the following code: 


qword spu_rlmaskqwbyte(qword a, int count) 

{ int bitshift = (-count << 3) & OxF8; 
return LSR(a,bitshift); 

} 


The resulting quadword is returned in vector d. 


Table 2-70: Rotate Left and Mask Quadword by Bytes 











RetumArgument Types Specific Intrinsics Assembly Mapping 
d a count 

vector unsigned char vector unsigned char 

vector signed char vector signed char 

vector unsigned short vector unsigned short 

vector signed short vector signed short 

vector unsigned int vector unsigned int int d= si_rotqmbyi(a, count) RTaBYa i 
vector signed int vector signed int (literal) (count = 7-bit immediate) a na coun 


vector unsigned long long vector unsigned long long 
vector signed long long vector signed long long 


vector float vector float 

vector double vector double 

vector unsigned char vector unsigned char 

vector signed char vector signed char 

vector unsigned short vector unsigned short 

vector signed short vector signed short 

vector unsigned int vector unsigned int int ’ 

vector signed int vector signed int (non-literal) 7 ~ Pe ee) a oa 


vector unsigned long long vector unsigned long long 
vector signed long long vector signed long long 
vector float vector float 

vector double vector double 
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spu_rlmaskqwbytebc: Rotate Left and Mask Quadword by Bytes From Bit Shift Count 
d = spu_rlmaskqwbytebc(a, count) 
This function uses a rotate and mask quadword by bytes from bit shift count operation to perform a quadword logical 
shift right (LSR) by bytes, where bits 24-28 of count represent the negated value of the desired byte right-shift 


amount. For example, if the bit shift count is —10, vector a is shifted right by 2 bytes. The effect of this function is 
more precisely shown by the following code: 


qword spu_rlmaskqwbytebc(qword a, int count) 

{ int bitshift = -(count & OxF8) & OxF8; 
return LSR(a,bitshift); 

} 


The resulting quadword is returned in vector d. 


Programming Note: The following example code shows typical usage of this function; it computes a vector d that is 
the value of vector a logically shifted right by n bits: 


d = spu_rlmaskqwbytebc(a,7-n); 
d = spu_rlmaskqw(d,-n); 


Table 2-71: Rotate Left and Mask Quadword by Bytes From Bit Shift Count 











Retum-Argument Types Specific Intrinsics Assembly Mapping 
d a count 

vector unsigned char vector unsigned char 

vector signed char vector signed char 

vector unsigned short vector unsigned short 

vector signed short vector signed short 

vector unsigned int vector unsigned int 

mee a alee int d=si_rotqmbybi(a, count) +ROTQMBYBI d, a, count 

vector signed int vector signed int 


vector unsigned long long vector unsigned long long 
vector signed long long vector signed long long 
vector float vector float 

vector double vector double 





spu_rlqw: Rotate Left Quadword by Bits 
d = spu_rlqw(a, count) 


Vector a is rotated to the left by the number of bits specified by the 3 least significant bits of count. Bits rotated out 
of the left end of the vector are rotated in on the right. The result is returned in vector d. 


Table 2-72: Rotate Left Quadword by Bits 











Retum Argument Types Specific Intrinsics Assembly Mapping 
d a count 
vector unsigned char vector unsigned char 
vector signed char vector signed char 
vector unsigned short vector unsigned short 
vector signed short vector signed short 
vector unsigned int vector unsigned int int d= si_rotqbii(a, count) 
i ROTQBII d, a, count 

vector signed int vector signed int (literal) (count = 7-bit immediate) 
vector unsigned long long vector unsigned long long 
vector signed long long vector signed long long 
vector float vector float 
vector double vector double 
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Return/Argument Types 


d 


Specific Intrinsics Assembly Mapping 


a count 








vector unsigned char 
vector signed char 

vector unsigned short 
vector signed short 
vector unsigned int 
vector signed int 

vector unsigned long long 
vector signed long long 
vector float 

vector double 


vector unsigned char 
vector signed char 
vector unsigned short 
vector signed short 


vector unsigned int int 


(non-literal ROTQBI d, a, count 


- - d= si_rotqbi(a, count) 
vector signed int ) 


vector unsigned long long 
vector signed long long 
vector float 

vector double 





spu_rlqwbyte: Rotate Left Quadword by Bytes 


d= 


spu_rlqwbyte(a, count) 


Vector a is rotated to the left by the number of bytes specified by the 4 least significant bits of count. Bytes rotated 
out of the left end of the vector are rotated in on the right. The result is returned in vector d. 


Table 2-73: Rotate Left Quadword by Bytes 





Return/Argument Types 


d 


Specific Intrinsics Assembly Mapping 


a count 








vector unsigned char 
vector signed char 
vector unsigned short 
vector signed short 
vector unsigned int 
vector signed int 

vector unsigned long long 
vector signed long long 
vector float 

vector double 

vector unsigned char 
vector signed char 

vector unsigned short 
vector signed short 
vector unsigned int 
vector signed int 

vector unsigned long long 
vector signed long long 
vector float 

vector double 


vector unsigned char 
vector signed char 
vector unsigned short 
vector signed short 


vector unsigned int int d= si_rotqbyi(a, count) 


(literal) ROTQBYI d, a, count 


vector signed int (count = 7-bit immediate) 
vector unsigned long long 

vector signed long long 

vector float 

vector double 

vector unsigned char 

vector signed char 

vector unsigned short 

vector signed short 


vector unsigned int int 


(non-literal) ROTQBY d, a, count 


; - d = si_rotqby(a, count) 
vector signed int 


vector unsigned long long 
vector signed long long 
vector float 

vector double 
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spu_rlqwbytebc: Rotate Left Quadword by Bytes From Bit Shift Count 


d = spu_rlqwbytebc(a, count) 


Vector a is rotated to the left by the number of bytes specified by bits 24-28 of count. Bytes rotated out of the left 
end of the vector are rotated in at the right. The result is returned in vector d. 


Table 2-74: Rotate Left Quadword by Bytes From Bit Shift Count 





Return/Argument Types 


d 


a count 


Specific Intrinsics Assembly Mapping 








vector unsigned char 
vector signed char 

vector unsigned short 
vector signed short 
vector unsigned int 
vector signed int 

vector unsigned long long 
vector signed long long 
vector float 

vector double 


vector unsigned char 
vector signed char 
vector unsigned short 
vector signed short 
vector unsigned int int 
vector signed int 

vector unsigned long long 
vector signed long long 

vector float 


vector double 


d= si_rotqbybi(a, count) ROTQBYBI d, a, count 





spu_sl: Element-Wise Shift Left by Bits 


d = spu_sl(a, count) 


Each element of vector a is shifted left by the number of bits specified by the corresponding element in vector 
count. If count is a scalar, the scalar value is first replicated for each element, and then a is shifted. 


Bits shifted out of the left end of the element are discarded, and zeros are shifted in at the right. A limited number of 
count bits are used depending on the size of the element. For halfword elements, the 5 least significant bits of 
count are used, and for word elements, the 6 least significant bits are used. The result is returned in the 
corresponding bit of vector d. 


Table 2-75: Element-Wise Shift Left by Bits 





Return/Argument Types 


d 


a count 


Specific Intrinsics Assembly Mapping 








vector unsigned short 
vector signed short 
vector unsigned int 
vector signed int 
vector unsigned short 
vector signed short 
vector unsigned short 
vector signed short 
vector unsigned int 
vector signed int 
vector unsigned int 
vector signed int 


vector unsigned short 
vector signed short 
vector unsigned int 
vector signed int 
vector unsigned short 
vector signed short 
vector unsigned short 
vector signed short 
vector unsigned int 
vector signed int 
vector unsigned int 
vector signed int 


vector unsigned int 


(literal) 


unsigned int 


(literal) 


unsigned int 


vector unsigned short 


7-bit unsigned int 


7-bit unsigned int 


d= si_shlh(a, count) SHLH d, a, count 


d= si_shl(a, count) SHL d, a, count 


d=si_shlhi(a, count) | SHLHId, a, count 


See section “2.2.1. Mapping Intrinsics with 
Scalar Operands”. 
d= si_shli(a, count) SHLI d, a, count 


See section “2.2.1. Mapping Intrinsics with 
Scalar Operands”. 
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spu_slqw: Shift Left Quadword by Bits 


a= 


spu_slqw(a, count) 


Vector a is shifted left by the number of bits specified by the 3 least significant bits of count. Bits shifted out of the 
left end of the vector are discarded, and zeros are shifted in at the right. The result is returned in vector d. 


Table 2-76: Shift Left Quadword by Bits 





Return/Argument Types 


d 


Specific Intrinsics Assembly Mapping 


a count 








vector unsigned char 
vector signed char 
vector unsigned short 
vector signed short 
vector unsigned int 
vector signed int 

vector unsigned long long 
vector signed long long 
vector float 

vector double 

vector unsigned char 
vector signed char 

vector unsigned short 
vector signed short 
vector unsigned int 
vector signed int 

vector unsigned long long 
vector signed long long 
vector float 

vector double 


vector unsigned char 
vector signed char 

vector unsigned short 
vector signed short 
vector unsigned int 
vector signed int 

vector unsigned long long 


unsigned int d= si_shlqbii(a, count) 


(literal) SHLQBII d, a, count 


(count = 7-bit immediate) 


vector signed long long 
vector float 

vector double 

vector unsigned char 
vector signed char 
vector unsigned short 
vector signed short 
vector unsigned int 
vector signed int 


unsigned int 


(non-literal) 7 ~ si_shlgbi(a, count) 


SHLQBI d, a, count 


vector unsigned long long 
vector signed long long 
vector float 

vector double 





spu_slqwbyte: Shift Left Quadword by Bytes 


d= 


spu_slqwbyte(a, count) 


Vector a is shifted left by the number of bytes specified by the 5 least significant bits of count. Bytes shifted out of 
the left end of the vector are discarded, and zeros are shifted in at the right. The result is returned in vector d. 


Table 2-77: Shift Left Quadword by Bytes 





Return/Argument Types 


d 


Specific Intrinsics Assembly Mapping 


a count 








vector unsigned char 
vector signed char 

vector unsigned short 
vector signed short 
vector unsigned int 
vector signed int 

vector unsigned long long 
vector signed long long 
vector float 

vector double 


vector unsigned char 
vector signed char 
vector unsigned short 
vector signed short 


d= si_shlqbyi(a, count) 


vector unsigned int unsigned int 


. SHLQBYI d, a, count 
(literal) 


vector signed int (count = 7-bit immediate) 
vector unsigned long long 

vector signed long long 

vector float 


vector double 
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Returm/Argument Types Specific Intrinsics Assembly Mapping 
d a count 
vector unsigned char vector unsigned char 
vector signed char vector signed char 
vector unsigned short vector unsigned short 
vector signed short vector signed short 
vector unsigned int vector unsigned int unsigned int a= si_shlqby(a, count) SHLQBY d, a, count 
vector signed int vector signed int (non-literal) 


vector unsigned long long vector unsigned long long 
vector signed long long vector signed long long 
vector float vector float 

vector double vector double 





spu_slqwbytebc: Shift Left Quadword by Bytes From Bit Shift Count 
d = spu_slqwbytebc(a, count) 


Vector a is shifted left by the number of bytes specified by bits 24-28 of count. Bytes shifted out of the left end of 
the vector are discarded, and zeros are shifted in at the right. The result is returned in vector d. 


Table 2-78: Shift Left Quadword by Bytes From Bit Shift Count 











Retum/Argument. Types Specific Intrinsics Assembly Mapping 
d a count 
vector unsigned char vector unsigned char 
vector signed char vector signed char 
vector unsigned short vector unsigned short 
vector signed short vector signed short 
vector unsigned int vector unsigned int : ; 
- ae - - 9 - unsigned int d = si_shlqbybi(a, count) SHLQBYBI d, a, count 
vector signed int vector signed int 


vector unsigned long long vector unsigned long long 
vector signed long long vector signed long long 
vector float vector float 

vector double vector double 
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2.11. Control Intrinsics 


spu_idisable: Disable Interrupts 


(void) spu_idisable() 


Asynchronous interrupts are disabled. 


SPU Low-Level Specific and Generic Intrinsics 


Programming Note: This intrinsic is considered volatile with respect to all other instructions; thus, the BID 


instruction will not be reordered with any other instructions. 


Table 2-79: Disable Interrupts 





Specific Intrinsics Assembly Mapping 








position dependent: 


ILA t, next_inst 
BIDt 


next_inst: 


N/A 


position independent: 


BRSL t, next inst 
next_inst: 

Alt, t, 8 

BIDt 





spu_ienable: Enable Interrupts 


(void) spu_ienable() 


Asynchronous interrupts are enabled. 


Programming Note: This intrinsic is considered volatile with respect to all other instructions; thus, the BIE 


instruction will not be reordered with any other instructions. 


Table 2-80: Enable Interrupts 





Specific Intrinsics Assembly Mapping 








position dependent: 
ILA t, next_inst 
BIE t 
next_inst: 
N/A 
position independent: 


BRSL t, next_inst 
next_inst: 

Alt, t, 8 

BIE t 
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spu_mffpscr: Move From Floating-Point Status and Control Register 


d = spu_mffpscr () 


The floating-point status and control register (FPSCR) Special Purpose Register is read, and the contents are 
returned in d. Unused bits of the FPSCR are forced to zero. 


Programming Note: This intrinsic is considered volatile with respect to the floating-point instructions and will not be 
reordered with respect to these instructions. The floating-point instructions include: cflts, cfltu, csflt, cuflt, 
dfa, dfm, dfma, dfms, dfnma, dfnms, dfs, fa, fceq, fcgt, fcmeq, fcmgt, fesd, fi, fm, fma, fms, fnms, 
frds, frest, frsqest, and fscrwr. 


Table 2-81: Move From Floating-Point Status and Control Register 











Betum Argument Types Specific Intrinsics Assembly Mapping 
vector unsigned int d = si_fscrrd() FSCRRD d 





spu_mfspr: Move From Special Purpose Register 
d = spu mfspr (register) 
The Special Purpose Register specified by enumeration constant register is read, and the contents are returned 


in d. 


Table 2-82: Move From Special Purpose Register 





Return/Argument Types PA a i 
- Specific Intrinsics Assembly Mapping 
d register 








unsigned int enumeration d= si_to_uint(si_mfspr(register)) MFSPR d, register 





spu_mtfpscr: Move to Floating-Point Status and Control Register 
(void) spu mtfpscr (a) 
The argument a is written to the floating-point status and control register (FPSCR). 


Programming Note: This intrinsic is considered volatile with respect to the floating-point instructions, and it will not 
be reordered with respect to these instructions. 


Table 2-83: Move to Floating-Point Status and Control Register 





Return/Argument Types ; bed ‘ 
Specific Intrinsics Assembly Mapping 
a 








vector unsigned int si_fscrwr(a) FSCRWR rt’, a 





‘The false target parameter rt is optimally chosen depending on register usage of neighboring instructions. 


spu_mtspr: Move to Special Purpose Register 
(void) spu_mtspr(register, a) 


The argument a is written to the Special Purpose Register specified by the enumeration constant register. 


Table 2-84: Move to Special Purpose Register 











Return/Argument Types , 23 i 
: Specific Intrinsics Assembly Mapping 
register a 
enumeration unsigned int —si_mtspr(register, si_from_uint(a)) | MTSPR register, a 
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spu_dsync: Synchronize Data 
(void) spu_dsync() 
All earlier store instructions are forced to complete before proceeding. This function ensures that all stores to local 


storage are visible to the MFC or PPU. 


Programming Note: This intrinsic is considered volatile with respect to the store and MFC write instructions, and it 
will not be reordered with respect to these instructions. The store and MFC instructions include: stqa, stqd, star, 
stqx, and wrch. 


Table 2-85: Synchronize Data 
Specific Intrinsics Assembly Mapping 


si_dsync() DSYNC 














spu_stop: Stop and Signal 
(void) spu_stop (type) 


Execution of the SPU program is stopped. The address of the stop instruction is placed into the least significant 
bits of the SPU NPC register. The signal type is written to the SPU status register, and the PPU is interrupted. 


Programming Note: This intrinsic is considered volatile with respect to all instructions, and it will not be reordered 
with any other instructions. 


Table 2-86: Stop and Signal 
Specific Intrinsics type Assembly Mapping 











si_stop(type) unsigned int (14-bit literal) STOP type 





spu_sync: Synchronize 
(void) spu_sync() 


(void) spu_sync_c() 


The processor waits until all pending store instructions have been completed before fetching the next sequential 
instruction. The spu_sync_c form of the intrinsic also performs channel synchronization prior to the instruction 
synchronization. This operation must be used following a store instruction that modifies the instruction stream. 


Programming Note: These synchronization intrinsics are considered volatile with respect to all instructions, and 
they will not be reordered with any other instructions. 


Table 2-87: Synchronize 











Generic Intrinsic Form Specific Intrinsics Assembly Mapping 
spu_sync si_sync() SYNC 
spu_sync_c si_syncc() SYNCC 





2.12. Channel Control Intrinsics 


The channel control intrinsics each take a channel number as an input. Channel numbers are literal unsigned 
integer values in the range from 0 to 127. Table 2-88 and Table 2-89 show the respective SPU and MFC channel 
numbers and their associated mnemonics. For additional details on the channels, see the Cell Broadband Engine™ 
Architecture. 


Programming Note: The channel intrinsics must never be reordered with respect to other channel commands or 
volatile local-storage memory accesses. 
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Table 2-88: SPU Channel Numbers’ 
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Channel Number 


Mnemonic 


Description 








N A WDM =| CO 


8 

11 
13 
14 
15 
28 
29 
30 


SPU_RdEventStat 
SPU_WrEventMask 
SPU_WrEventAck 
SPU_RdSigNotify1 
SPU_RdSigNotify2 
SPU_WrDec 
SPU_RdDec 
SPU_RdEventMask 
SPU_RdMachStat 
SPU_WrSRRO 
SPU_RdSRRO 
SPU_WrOutMbox 
SPU_RdInMbox 
SPU_WrOutIntrMbox 


Read event status with mask applied. 

Write event mask. 

Write End of event processing. 

Signal notification 1. 

Signal notification 2. 

Write decrementer count. 

Read decrementer count. 

Read event mask. 

Read SPU run status. 

Write SPU machine state save/restore register 0 (SRRO). 
Read SPU machine state save/restore register 0 (SRRO). 
Write outbound mailbox contents. 

Read inbound mailbox contents. 

Write outbound interrupt mailbox contents (interrupting PPU). 





1 Channel enumerants are defined in spu_intrinsics-h. 


Table 2-89: MFC Channel Numbers' 





Channel Number 


Mnemonic 


Description 








9 

12 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 


MFC_WrMSSyncReq 
MFC_RdTagMask 
MFC_LSA 
MFC_EAH 
MFC_EAL 
MFC_Size 
MFC_TagID 
MFC_Cmd 
MFC_WrTagMask 
MFC_WrTagUpdate 
MFC_RdTagStat 
MFC_RdListStallStat 
MFC_WrListStallAck 
MFC_RdAtomicStat 


Write multisource synchronization request. 

Read tag mask. 

Write local memory address command parameter. 

Write high order DMA effective address command parameter. 
Write low order DMA effective address command parameter. 
Write DMA transfer size command parameter. 

Write tag identifier command parameter. 

Write and enqueue DMA command with associated class ID. 
Write tag mask. 

Write request for conditional/unconditional tag status update. 
Read tag status with mask applied. 

Read DMA list stall-and-notify status. 

Write DMA list stall-and-notify acknowledge. 


Read completion status of last completed immediate MFC atomic 
update command. 





‘The MFC channels are only valid for SPUs within a CBEA-compliant system. MFC channel enumerants are defined in 


spu_intrinsics.h 


spu_readch: Read Word Channel 


d = spu_readch (channel) 


The word channel that is specified by channel] is read, and the contents are placed in a. If the channel does not 
exist, a value of zero is returned. 


Table 2-90: Read Word Channel 











Return Argument I ypes Specific Intrinsics Assembly Mapping 
d channel 
unsigned int enumeration d= si_to_uint(si_rdch(channe1)) RDCH d, channel 
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spu_readchqw: Read Quadword Channel 
d = spu_readchqw (channel) 


The quadword channel that is specified by channel is read, and the contents are placed in vector d. If the channel 
does not exist, a value of zero is returned. 


Table 2-91: Read Quadword Channel 











Retry Argument Types Specific Intrinsics Assembly Mapping 
d channel 
vector unsigned int enumeration d=si_rdch(channel) RDCH d, channel 





spu_readchent: Read Channel Count 
d = spu_readchcnt (channel) 
A Read Count operation is performed on thes channel that is specified by channel, and the count is placed in a. If 


the channel does not exist, a value of zero is returned in d. 


Table 2-92: Read Channel Count 
Return/Argument Types 
d channel 





Specific Intrinsics Assembly Mapping 








unsigned int enumeration d= si_rchent(channe1l) RCHCNT d, channel 





spu_writech: Write Word Channel 


(void) spu_writech(channel, a) 
The contents of scalar a are written to the channel that is specified by the enumeration constant channel. 
Table 2-93: Write Word Channel 


Return/Argument Types 
channel a 





Specific Intrinsics Assembly Mapping 








si_wrch(channel1, si_from_int(a)) 
unsigned int | si_wrch(channel, si_from_uint(a)) 


enumeration WRCH channel, a 





spu_writechqw: Write Quadword Channel 
(void) spu_writechqw(channel, a) 


The contents of vector a are written to the channel that is specified by the enumeration constant channel. 


Table 2-94: Write Quadword Channel 
Return/Argument Types 
channel a 
vector unsigned char 





Specific Intrinsics Assembly Mapping 








vector signed char 
vector unsigned short 
vector signed short 
S vector unsigned int , 
enumeration : : si_wrch(channel, a) WRCH channel, a 
vector signed int 
vector unsigned long long 
vector signed long long 
vector float 
vector double 
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2.13. Scalar Intrinsics 


All of the previous intrinsic functions perform operations only on vector data types. This section describes special 
utility intrinsics that allow programmers to efficiently coerce scalars to vectors, or vectors to scalars. With the aid of 
these intrinsics, programmers can use intrinsic functions to perform operations between vectors and scalars without 
having to revert to assembly language. This is especially important when there is a need is to perform an operation 
that cannot be conveniently expressed in C, such as shuffling bytes. 


spu_extract: Extract Vector Element From Vector 


d = spu_extract (a, element) 


The element that is specified by element is extracted from vector a and returned in d. Depending on the size of the 
element, only a limited number of the least significant bits of the element index are used. For 1-, 2-, 4-, and 8-byte 
elements, only 4, 3, 2, and 1 of the least significant bits of the element index are used, respectively. 


Table 2-95: Extract Vector Element From Vector 











Return/Argument Types Specific oad 
J s InN intrinsics Assembly Mapping 
: ROTQBY d, a, element 
unsigned char vector unsigned char N/A ROTMI d, d, -24 
: p ROTQBY d, a, element 
signed char vector signed char N/A ROTMAI d, d, -24 
SHLI t, element, 1 
unsigned short vector unsigned short N/A ROTQBY d, a, t 
ROTMI d, d, -16 
SHLI t, element, 1 
signed short vector signed short N/A ROTQBY d, a, t 
ROTMAI d, d, -16 
: : : : : : SHLI t, element, 2 
unsigned int vector unsigned int int (non-literal) N/A ROTQBY d, a, t 
: : ; : SHLI t, element, 2 
signed int vector signed int N/A ROTQBY d, a, t 
3 : SHLI t, element, 3 
unsigned long long vector unsigned long long N/A ROTQBY d, a, t 
S f SHLI t, element, 3 
signed long long vector signed long long N/A ROTQBY d, a, t 
SHLI t, element, 2 
float vector float N/A ROTQBY d, a, t 
SHLI t, element, 3 
double vector double N/A ROTQBY d, a, t 
unsigned char vector unsigned char N/A ROTQBYI d. a, element-3 
signed char vector signed char N/A Di 
unsigned short vector unsigned short N/A ROTQBYI d, a, 2*(element-1) 
signed short vector signed short N/A A 
igned int t igned int N/A 
id He a ares ite int (literal) i ROTQBYI d, a, 4*element 
signed int vector signed int N/A 
unsigned long long vector unsigned long long N/A ROTQBYI d. a, 8*element 
signed long long vector signed long long N/A a 
float vector float N/A ROTQBYI d, a, 4*element 
double vector double N/A ROTQBYI d, a, 8*element 





‘If the specified element is a known value (literal) and specifies the preferred (scalar) element, no instructions are produced. For 1 

byte elements, the scalar element is 3. For 2 byte elements, the scalar element is 1. For 4 and 8 byte elements, the scalar element is 0. 
Sign extension may still be performed if a subsequent operation requires the resulting scalar to be cast to a larger data type. This sign 
extension may be deferred until the subsequent operation. 
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spu_insert: Insert Scalar into Specified Vector Element 


d = spu_insert(a, b, element) 


SPU Low-Level Specific and Generic Intrinsics 


Scalar a is inserted into the element of vector b that is specified by the element parameter, and the modified 
vector is returned. All other elements of b are unmodified. Depending on the size of the element, only a limited 
number of the least significant bits of the element index are used. For 1-, 2-, 4-, and 8-byte elements, only 4, 3, 2, 


and 1 of the least significant bits of the element index are used, respectively. 


Table 2-96: Insert Scalar into Specified Vector Element 











Return/Argument Types Specific Assembly Mapping 
d a b element Intrinsics 
vector unsigned char unsigned char vector unsigned char N/A CBD t, 0(element) 
vector signed char signed char vector signed char N/A SHUFB d, a, b, t 
vector unsigned short unsigned short vector unsigned short N/A SHLI t, element, 1 
CHD t, O(t) 
vector signed short signed short vector signed short N/A SHUFB d, a, b, t 
vector unsigned int unsigned int vector unsigned int ini N/A SHLI t. element, 2 
vector signed int signed int vector signed int (non-lteral) N/A CWD t, O(t) 
SHUFB d, a, b, t 
vector float float vector float N/A 
vector unsigned unsigned vector unsigned N/A 
long long long long long long SHLI t, element, 3 
vector signed signed vector signed N/A CDD t, O(t) 
long long long long long long SHUFB d, a, b, t 
vector double double vector double N/A 
vector unsigned char unsigned char vector unsigned char N/A LQD pat, CONST_AREA 
: F , SHUFB d, a, b, pat 
vector signed char signed char vector signed char N/A 
vector unsigned short unsigned short vector unsigned short N/A LQD pat, CONST_AREA 
vector signed short signed short vector signed short N/A SHUFB d, a, b, pat 
vector unsigned int unsigned int vector unsigned int N/A 
: F 7 : : : int LQD pat, CONST_AREA 
vector signed int signed int vector signed int (literal) N/A SHUFB d, a, b, pat 
vector float float vector float N/A 
vector unsigned unsigned vector unsigned N/A 
long long long long long long 
torsined ianed tor sianed LQD pat, CONST_AREA 
vector signe signe vector signe N/A SHUFB d, a, b, pat 
long long long long long long 
vector double double vector double N/A 





"If the specified element is a known value (literal), a shuffle pattern can be loaded from the constant area. The contents of the pattern 


depend on the size of the element and the element being replaced. 
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spu_promote: Promote Scalar to a Vector 


d = spu_promote (a, element) 
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Scalar a is promoted to a vector containing a in the element that is specified by the element parameter, and the 
vector is returned in d. All other elements of the vector are undefined. Depending on the size of the element/scalar, 
only a limited number of the least significant bits of the element index are used. For 1-, 2-, 4-, and 8-byte elements, 
only 4, 3, 2, and 1 of the least significant bits of the element index are used, respectively. 


Table 2-97: Promote Scalar to a Vector 











Return/Argument Types Specific Assembly Mapping’ 
d a element Intrinsics 
vector unsigned char unsigned char N/A SFI t, element, 3 
vector signed char signed char N/A ROTQBY d, a, t 
vector unsigned short unsigned short N/A SFI t, element, 1 
SHLI t, t, 1 
vector signed short signed short N/A ROTQBY d, a, t 
vector unsigned int unsigned int int (non-literal) N/A SFI t. element. 0 
vector signed int signed int N/A SHLI t, t, 2 
vector float float N/A Ree dads 
vector unsigned long long unsigned long long N/A 
: : SHLI t, element, 3 

vector signed long long signed long long N/A ROTOBY d. a.t 
vector double double N/A 
vector unsigned char unsigned char N/A ROTQBYI d, a, 
vector signed char signed char N/A (3-element) 
vector unsigned short unsigned short N/A ROTQBYI d, a, 2* 
vector signed short signed short N/A (1-element) 

t igned int igned int N/A 
vector unsigned in unsigned in int (literal) / 
vector signed int signed int N/A ROTQBYI d, a, -4*element 
vector float float N/A 
vector unsigned long long unsigned long long N/A 
vector signed long long signed long long N/A ROTQBYI d, a, -8*element 
vector double double N/A 





‘If the specified element is of known value (literal) and specifies the preferred (scalar) element, no instructions are produced. For 1 
byte elements, the scalar element is 3. For 2 byte elements, the scalar element is 1. For 4 and 8 byte elements, the scalar element is 


0. 
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3. Composite Intrinsics 


This chapter describes several composite intrinsics that have practical use for a wide variety of SPU programs. 
Composite intrinsics are those intrinsics that can be constructed from a series of low-level intrinsics. In this context, 
“low-level” means generic or specific. Because of the complexity of these operations, frequency of use, and 
scheduling constraints, the particular services are provided as intrinsics. 


Composite intrinsics are DMA intrinsics. The DMA intrinsics rely heavily on the channel control intrinsics. 


spu_mfcdma3z2: Initiate DMA To/From 32-Bit Effective Address 


spu mfcdma32 (ls, ea, size, tagid, cmd) 


A DMA transfer of size bytes is initiated from local storage to system memory or from system memory to local 
storage. The effective address that is specified by ea is a 32-bit virtual memory address. The local-storage address 
is specified by the 1s parameter. The DMA request is issued using the specified tagid. The type and direction of 
DMA, bandwidth reservation, and class ID are encoded in the cmd parameter. For additional details about the 
commands and restrictions on the size of supported DMA operations, see the Cell Broadband Engine™ Architecture. 


Table 3-98: Initiate DMA To/From 32-Bit Effective Address 





Return/Argument Types 


. Assembly Mapping 
Is ea size tagid cmd 








spu_writech(MFC_LSA, 1s) 
spu_writech(MFC_EAL, ea) 
volatile void * unsignedint unsignedint unsignedint unsignedint spu_writech(MFC_Size, size) 
spu_writech(MFC_TagID, tagid) 
spu_writech(MFC_Cmd, cmd) 





spu_mfcdmaé6é4: Initiate DMA To/From 64-Bit Effective Address 


spu_mfcdma64 (ls, eahi, ealow, size, tagid, cmd) 


A DMA transfer of size bytes is initiated from local to system memory or from system memory to local storage. The 
effective address that is specified by the concatenation of eahi and ealow is a 64-bit virtual memory address. The 
local-storage address is specified by the 1s parameter. The DMA request is issued using the specified tagid. The 
type and direction of DMA, bandwidth reservation, and class ID are encoded in the cmd parameter. For additional 
details about the commands and restrictions on the size of supported DMA operations, see the Cell Broadband 
Engine” Architecture. 


Table 3-99: Initiate DMA To/From 64-Bit Effective Address 





Return/Argument Types 


Assembly Mappin 
Is eahi ealow sh tagid cmd re 








spu_writech(MFC_LSA, 1s) 
spu_writech(MFC_EAH, eahi) 
spu_writech(MFC_EAL, ealow) 
spu_writech(MFC_Size, size) 
spu_writech(MFC_TagID, tagid) 
spu_writech(MFC_CMD, cma) 


volatile void * unsigned int unsigned int unsigned int unsigned int unsigned int 
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spu_mfcstat: Read MFC Tag Status 
d = spu_mfcstat (type) 
The current MFC tag status is read and logically ANDed with the current tag mask, and the result is returned in d. 
The type of read to be performed is specified by the type parameter. If the type is 0, the function reads and 
immediately returns the current MFC tag status. If the type is 1, the function reads and blocks for any outstanding 
MFC tags to complete, and if the type is 2, the function reads and blocks for all outstanding MFC tags to complete. 
Table 3-100: Read MFC Tag Status 


Return/Argument Types 
d type 





Assembly Mapping 








spu_writech(MFC_WrTagUpdate, type) 


unsigned int  unsignedint = gpu _readch(MFC_RdTagStat) 





C/C++ Language Extensions for Cell Broadband Engine™ Architecture, Version 2.3 


SONY 


SONY 


SOMEMER © 





4. Programming Support for MFC Input and Output 


Several MFC utility functions are described in this chapter. These functions may be provided as a programming 
convenience; none of them are required. The functions that are described can be implemented either as macro 
definitions or as built-in functions within the compiler. To access these functions, programmers must include the 
header file spu_mfcio.h. 


For each function listed in the sections below, the function usage is shown, followed by a brief description and the 
function implementation. 


4.1. Structures 


A principal data structure is the MFC List DMA. The elements in this list are described below. 


mfc_list_element: DMA List Element for MFC List DMA 
typedef struct mfc_list_ element { 


uint64 t notify = iz 
uint64 t reserved z LG; 
uint64 t size: 15; 

uint64 t eal > 32; 


} mfc_list element t; 


The mfc_list element is an element in the array MFC List DMA. The structure is comprised of several bit-fields: 
notify is the stall-and-notify bit, reserved is set to zero. size is the list element transfer size, and ea is the low 
word of the 64-bit effective address. 


4.2. Effective Address Utilities 


A frequent requirement for MFC programming is to manipulate effective addresses. This section describes several 
functions for performing the most common operations. 


mfc_ea2h: Extract Higher 32 Bits From Effective Address 
(uint32 t) mfc_ea2h(uint64 t ea) 


The higher 32 bits are extracted from the 64-bit effective address ea. 


Implementation 
(uint32 t) ((uint64 t) (ea) >>32) 


mfc_ea2l: Extract Lower 32 Bits From Effective Address 
(uint32 t) mfc _ea2l (uint64 t ea) 


The lower 32 bits are extracted from the 64-bit effective address ea. 


Implementation 
(uint32_t) (ea) 
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mfc_hl2ea: Concatenate Higher 32 Bits and Lower 32 Bits 
(uint64 t) mfc hl2ea(uint32_t high, uint32_t low) 


The higher 32 bits of a 64-bit address high and the lower 32 bits low are concatenated. 


Implementation 
si_to_ullong(si_selb(si_from_uint (high), 
si from si rotqbyi(si_from_uint(low), -4), si_fsmbi(0x0f0f))) 


mfc_ceil128: Round Up Value to Next Multiple of 128 


(uint32_t) mfc_ceil128(uint32_ t value) 
(uinte4 t) mfc_ceill28(uint64 t value) 
(uintptr_t) mfc_ceill28(uintptr_t value) 


The argument value is rounded to the next higher multiple of 128. 


Implementation 
(value + 127) & ~127 


Example 


volatile char buf[256]; 
volatile void *ptr = (volatile void*)mfc_ceil1l28((uintptr_t)buf); 


4.3. MFC DMA Commands 


This section describes functions that implement the various MFC DMA commands. See the Cell Broadband 
Engine” Architecture for a description of the DMA commands, including restrictions on the size of the supported 
operations. 


MFC DMA command mnemonics are listed in Table 4-101. 


Table 4-101: MFC DMA Command Mnemonics‘ 





Mnemonic Opcode Command 
MFC_PUT_CMD 0x0020 put 
MFC_PUTB_CMD 0x0021 putb 
MFC_PUTF_CMD 0x0022 putf 
MFC_GET_CMD 0x0040 get 
MFC_GETB_CMD 0x0041 getb 
MFC_GETF_CMD 0x0042 getf 





1 MFC command enumerants are defined in spu_mfcio-h. 


mfc_put: Move Data From Local Storage to Effective Address 
(void) mfc_put(volatile void *ls, uint64 t ea, uint32 t size, uint32 t tag, 
uint32 t tid, uint32 t rid) 
Data is moved from local storage to system memory. The arguments to this function correspond to the arguments of 
the spu_mfcdma6é4 command: 1s is the local-storage address, ea is the effective address in system memory, 
size is the DMA transfer size, tag is the DMA tag, tidis the transfer class identifier, and rid is the replacement 
class identifier. 


Implementation 


spu_mfcdma64(ls, mfc_ea2h (ea), mfc_ea2l (ea), size, tag, 
( (tid<<24) | (rid<<16) |MFC_PUT_CMD)) 
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mfc_putb: Move Data From Local Storage to Effective Address with Barrier 


(void) mfc_putb(volatile void *ls, uint64 t ea, uint32 t size, uint32 t tag, 
uint32 t tid, uint32 t rid) 


Data is moved from local storage to system memory. The arguments to this function correspond to the arguments of 
the spu_mfcdma6é4 command: 1s is the local-storage address, ea is the effective address in system memory, 
size is the DMA transfer size, tag is the DMA tag, tidis the transfer class identifier, and rid is the replacement 
class identifier. This command and all subsequent commands with the same tag ID as this command are locally 
ordered with respect to all previously issued commands within the same tag group and command queue. 


Implementation 


spu_mfcdma64(ls, mfc_ea2h (ea), mfc_ea2l (ea), size, tag, 
( (tid<<24) | (rid<<16) |MFC_PUTB_CMD) ) 


mfc_putf: Move Data From Local Storage to Effective Address with Fence 


(void) mfc _putf (volatile void *ls, uint64 t ea, uint32 t size, uint32 t tag, 
uint32 t tid, uint32 t rid) 


Data is moved from local storage to system memory. The arguments to this function correspond to the arguments of 
the spu_mfcdma6é4 command: 1s is the local-storage address, ea is the effective address in system memory, 
size is the DMA transfer size, tag is the DMA tag, tidis the transfer class identifier, and rid is the replacement 
class identifier. This command is locally ordered with respect to all previously issued commands within the same tag 
group and command queue. 


Implementation 


spu_mfcdma64(ls, mfc_ea2h (ea), mfc_ea2l (ea), size, tag, 
( (tid<<24) | (rid<<16) |MFC_PUTF_CMD) ) 


mfc_get: Move Data From Effective Address to Local Storage 


(void) mfc_get(volatile void *ls, uint64 t ea, uint32 t size, uint32 t tag, 
uint32 t tid, uint32 t rid) 


Data is moved from system memory to local storage. The arguments to this function correspond to the arguments of 
the spu_mfcdma6é4 command: 1s is the local-storage address, ea is the effective address in system memory, 
size is the DMA transfer size, tag is the DMA tag, tidis the transfer class identifier, and rid is the replacement 
class identifier. 


Implementation 


spu_mfcdma64(ls, mfc_ea2h (ea), mfc_ea2l (ea), size, tag, 
( (tid<<24) | (rid<<16) |MFC_GET_CMD)) 





mfc_getf: Move Data From Effective Address to Local Storage with Fence 


(void) mfc _getf (volatile void *ls, uint64 t ea, uint32 t size, uint32_t tag, 
uint32 t tid, uint32 t rid) 


Data is moved from system memory to local storage. The arguments to this function correspond to the arguments of 
the spu _ mfcdma64 command: 1s is the local-storage address, ea is the effective address in system memory, 
size is the DMA transfer size, tag is the DMA tag, tidis the transfer class identifier, and rid is the replacement 
class identifier. This command is locally ordered with respect to all previously issued commands within the same tag 
group and command queue. 


Implementation 


spu_mfcdma64(ls, mfc_ea2h (ea), mfc_ea2l (ea), size, 
tag, ((tid<<24) | (rid<<16) |MFC_GETF CMD) ) 
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mfc_getb: Move Data From Effective Address to Local Storage with Barrier 


(void) mfc_getb (volatile void *ls, uint64_ t ea, uint32_t size, uint32 t tag, 
uint32 t tid, uint32 t rid) 


Data is moved from system memory to local storage. The arguments to this function correspond to the arguments of 
the spu_mfcdma6é4 command: 1s is the local-storage address, ea is the effective address in system memory, 
size is the DMA transfer size, tag is the DMA tag, tidis the transfer class identifier, and rid is the replacement 
class identifier. This command and all subsequent commands with the same tag ID as this command are locally 
ordered with respect to all previously issued commands within the same tag group and command queue. 


Implementation 


spu_mfcdmao4(ls, mfc_ea2h (ea), mfc_ea2l (ea), size,tag, 
( (tid<<24) | (rid<<16) |MFC_GETB_CMD) ) 





4.4. MFC List DMA Commands 


This section describes utility functions that can be used to manage the MFC List DMA. See the Cell Broadband 
Engine™ Architecture for a description of the DMA commands, including restrictions on the size of the supported 
operations. 


MFC List DMA command mnemonics are listed in Table 4-102. 


Table 4-102: MFC List DMA Command Mnemonics’ 





Mnemonic Opcode Command 
MFC_PUTL_CMD 0x0024 putl 
MFC_PUTLB_CMD 0x0025 putlb 
MFC_PUTLF_CMD 0x0026 putlf 
MFC_GETL_CMD 0x0044 getl 
MFC_GETLB_CMD 0x0045 getlb 
MFC_GETLF_CMD 0x0046 gellf 





1 MFC command enumerants are defined in spu_mfcio.h. 


mfc_putl: Move Data From Local Storage to Effective Address Using MFC List 


(void) mfc_putl (volatile void *ls, uint64 t ea, mfc_list_element_t *list, 
uint32 t list size, uint32 t tag, uint32 t tid, uint32 t rid) 


Data is moved from local storage to system memory using the MFC list. The arguments to this function correspond 
to the arguments of the spu_mfcdma6é4 command: 1s is the local-storage address, ea is the effective address in 
system memory, list is the DMA list address, list size is the DMA list size, tagis the DMA tag, tid is the 
transfer class identifier, and rid is the replacement class identifier. 


Implementation 
spu_mfcdma64(ls, mfc_ea2h (ea), (unsigned int) (list), list size, tag, 
( (tid<<24) | (rid<<16) |MFC_PUTL_CMD) ) 


mfc_putlb: Move Data From Local Storage to Effective Address Using MFC List with Barrier 
(void) mfc_putlb (volatile void *ls, uint64 t ea, mfc list_element t *list, 
uint32 t list size, uint32 t tag, uint32_ t tid, uint32 t rid) 
Data is moved from local storage to system memory using the MFC list. The arguments to this function correspond 
to the arguments of the spu_mfcdma6é4 command: 1s is the local-storage address, ea is the effective address in 


system memory, list is the DMA list address, list size is the DMA list size, tag is the DMA tag, tid is the 
transfer class identifier, and ridis the replacement class identifier. This command and all subsequent commands 
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with the same tag ID as this command are locally ordered with respect to all previously issued commands within the 
same tag group and command queue. 


Implementation 


spu_mfcdma64(ls,mfc_ea2h(ea), (unsigned int) (list), list size, tag, 
( (tid<<24) | (rid<<16) |MFC_PUTLB_CMD)) 


mfc_putlf: Move Data From Local Storage to Effective Address Using MFC List with Fence 


(void) mfc_putlf(volatile void *ls, uint64 t ea, mfc list _element_t *list, 
uint32 t list size, uint32_t tag, uint32 t tid, uint32 t rid) 


Data is moved from local storage to system memory using the MFC list. The arguments to this function correspond 
to the arguments of the spu_mfcdma6é4 command: 1s is the local-storage address, ea is the effective address in 
system memory, list is the DMA list address, list size is the DMA list size, tag is the DMA tag, tid is the 
transfer class identifier, and ridis the replacement class identifier. This command is locally ordered with respect to 
all previously issued commands within the same tag group and command queue. 


Implementation 


spu_mfcdma64(ls, mfc_ea2h(ea), (unsigned int) (list), list _size, tag, 
( (tid<<24) | (rid<<16) |MFC_PUTLF_CMD) ) 


mfc_getl: Move Data From Effective Address to Local Storage Using MFC List 


(void) mfc_getl (volatile void *ls, uint64 t ea, mfc list_element t *list, 
uint32 t list size, uint32 t tag, uint32 t tid, uint32 t rid) 


Data is moved from system memory to local storage using the MFC list. The arguments to this function correspond 
to the arguments of the spu_mfcdma6é4 command: 1s is the local-storage address, ea is the effective address in 
system memory, list is the DMA list address, list size is the DMA list size, tag is the DMA tag, tid is the 
transfer class identifier, and rid is the replacement class identifier. 


Implementation 


spu_mfcdma64(ls,mfc_ea2h(ea), (unsigned int) (list), list size, tag, 
( (tid<<24) | (rid<<16) |MFC_GETL_CMD) ) 





mfc_getlb: Move Data From Effective Address to Local Storage Using MFC List with Barrier 


(void) mfc_getlb (volatile void *ls, uint64 t ea, mfc list_element t *list, 
uint32 t list size, uint32 t tag, uint32_ t tid, uint32 t rid) 


Data is moved from system memory to local storage using the MFC list. The arguments to this function correspond 
to the arguments of the spu_mfcdma6é4 command: 1s is the local-storage address, ea is the effective address in 
system memory, list is the DMA list address, list size is the DMA list size, tag is the DMA tag, tid is the 
transfer class identifier, and ridis the replacement class identifier. This command and all subsequent commands 
with the same tag ID as this command are locally ordered with respect to all previously issued commands within the 
same tag group and command queue. 


Implementation 


spu_mfcdma64(ls,mfc_ea2h(ea), (unsigned int) (list), list size, tag, 
( (tid<<24) | (rid<<16) |MFC_GETLB_CMD)) 





mfc_getlf: Move Data From Effective Address to Local Storage Using MFC List with Fence 


(void) mfc_getlf(volatile void *ls, uint64 t ea, mfc_list_ element _t *list, 
uint32 t List size, uint32 t tag, uint32 t tid, uint32 t rid) 


Data is moved from system memory to local storage using the MFC list. The arguments to this function correspond 
to the arguments of the spu_mfcdma6é4 command: 1s is the local-storage address, ea is the effective address in 
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system memory, list is the DMA list address, list size is the DMA list size, tag is the DMA tag, tid is the 
transfer class identifier, and rid is the replacement class identifier. This command is locally ordered with respect to 
all previously issued commands within the same tag group and command queue. 


Implementation 


spu_mfcdma64(ls,mfc_ea2h(ea), (unsigned int) (list), list size, tag, 
( (tid<<24) | (rid<<16) |MFC_GETLF_CMD) ) 





4.5. MFC Atomic Update Commands 


This section describes utility functions that can be used to manage the MFC Atomic DMA. See the Cell Broadband 
Engine™ Architecture for a description of the DMA commands, including restrictions on the size of the supported 
operations. 


MFC Atomic DMA command mnemonics are listed in Table 4-103. 


Table 4-103: MFC Atomic Update Command Mnemonics" 





Mnemonic Opcode Command 
MFC_GETLLAR_CMD 0x00D0 getllar 
MFC_PUTLLC_CMD 0x00B4 putllc 
MFC_PUTLLUC_CMD 0x00B0 putlluc 
MFC_PUTQLLUC_CMD 0x00B8 putqlluc 





1 MFC command enumerants are defined in spu_mfcio.h. 


mfc_getllar: Get Lock Line and Create Reservation 
(void) mfc_getllar(volatile void *ls, uint64 t ea, uint32 t tid, uint32 t rid) 
The lock line is obtained and a reservation is created. The arguments to this function correspond to the arguments 


of the spu_mfcdma64 command: 1s is the 128-byte-aligned local-storage address, ea is the effective address in 
system memory, tidis the transfer class identifier, and ridis the replacement class identifier. 


The mfc_getllar command does not have a tag ID. The command is immediately executed by the MFC. The 
transfer size is fixed at 128 bytes. Anmfc_read_atomic_status() must follow this function to verify completion 
of the command. 


Implementation 


spu_mfcdma64(ls, mfc_ea2h (ea), mfc_ea2l(ea),128, 0, 
((tid<<24) | (rid<<16) |MFC_GETLLAR CMD) ) 





mfc_putllc: Put Lock Line if Reservation for Effective Address Exists 
(void) mfc_putllc(volatile void *ls, uint64 t ea, uint32 t tid, uint32 t rid) 
The lock line is put if a reservation for effective address exists. The arguments to this function correspond to the 


arguments of the spu_mfcdma64 command: 1s is the 128-byte-aligned local-storage address, ea is the effective 
address in system memory, tidis the transfer class identifier, and ridis the replacement class identifier. 


The mfc_putllc command does not have a tag ID and is immediately executed by MFC. Transfer size is fixed at 
128 bytes. Anmfc_read_atomic_ status () must follow this command to verify completion of the command. 


Implementation 


spu_mfcdmao4(ls, mfc_ea2h (ea), mfc_ea2l(ea),128, 0, 
( (tid<<24) | (rid<<16) |MFC_PUTLLC_CMD) ) 
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mfc_putlluc: Put Lock Line Unconditional 
(void) mfc_putlluc(volatile void *ls, uint64 t ea, uint32 t tid, uint32 t rid) 


The lock line is put regardless of the existence of a previously made reservation. The arguments to this function 
correspond to the arguments of the spu_mfcdma6é4 command: 1s is the 128-byte-aligned local-storage address, 
ea is the effective address in system memory, tidis the transfer class identifier, and ridis the replacement class 
identifier. 


This command does not have a tag ID and is immediately executed by MFC. The transfer size is fixed at 128 bytes. 
The mfc_read_atomic_ status () must follow this function to verify completion of the command. 


Implementation 


spu_mfcdma64(ls,mfc_ea2h(ea),mfc_ea2l(ea), 128, 0, 
((tid<<24) | (rid<<16) |MFC_PUTLLUC_CMD) ) 


mfc_putqlluc: Put Queued Lock Line Unconditional 


(void) mfc_putqlluc(volatile void *ls, uint64 t ea, uint32 t tag, uint32 t tid, 
uint32 t rid) 


The lock line is put in the queue regardless of the existence of a previously made reservation. The arguments to this 
function correspond to the arguments of the spu_mfcdmaé4 command: 1s is the 128-byte-aligned local-storage 
address, ea is the effective address in system memory, tidis the transfer class identifier, and ridis the 
replacement class identifier. 


Transfer size is fixed at 128 bytes. This command is functionally equivalent to the mfc_putlluc command. The 
difference between the two commands is the order in which the commands are executed and the way that 
completion is determined. mfc_putlluc is performed immediately; in contrast, mfc_putqlluc is placed into the 
MFC command queue, along with other MFC commands. Because this command is queued, it is executed 
independently of any pending immediate mfc_getllar,mfc_putllc, ormfc_put1lluc commands. To determine 
if this command has been performed, a program must wait for a tag-group completion. 


Implementation 


spu_mfcdmao4(ls, mfc_ea2h (ea), mfc_ea2l (ea), 128, tag, 
( (tid<<24) | (rid<<16) |MFC_PUTQLLUC_CMD) ) 


4.6. MFC Synchronization Commands 


This section describes functions that implement the MFC synchronization commands, including signal notification 
and storage ordering. See the Cell Broadband Engine™ Architecture for a description of the DMA commands, 
including restrictions on the size of the supported operations. 


MFC synchronization command mnemonics are listed in Table 4-104. 


Table 4-104: MFC Synchronization Command Mnemonics’ 





Mnemonic Opcode Command 
MFC_SNDSIG_CMD 0x00A0 sndsig 
MFC_SNDSIGB_CMD 0x00A1 sndsigb 
MFC_SNDSIGF_CMD 0x00A2 sndsigf 
MFC_BARRIER_CMD 0x00C0 barrier 
MFC_EIEIO_CMD 0x00C8 mfceieio 
MFC_SYNC_CMD 0x00CC mfcsync 





1 MFC command enumerants are defined in spu_mfcio.-h. 
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mfc_sndsig: Send Signal 


(void) mfc_sndsig(volatile void *ls, uint64 t ea, uint32 t tag, uint32 t tid, 
uint32 t rid) 


An mfc_sndsig command is enqueued into the DMA queue, or is stalled when the DMA queue is full. The 
arguments to this function correspond to the arguments of the spu_mfcdmaé4 command: 1s is the local-storage 
address, ea is the effective address in system memory, tag is the DMA tag, tidis the transfer class identifier, and 
rid is the replacement class identifier. Transfer size is fixed at 4 bytes. 


Implementation 


spu_mfcdma64(ls,mfc_ea2h(ea),mfc_ea2l1(ea),4,tag, 
( (tid<<24) | (rid<<16) |MFC_SNDSIG CMD) ) 


mfc_sndsigb: Send Signal with Barrier 


(void) mfc_sndsigb(volatile void *ls, uint64 t ea, uint32 t tag, uint32 t tid, 
uint32 t rid) 


An mfc_sndsigb command is enqueued into the DMA queue, or is stalled when the DMA queue is full. The 
arguments to this function correspond to the arguments of the sou_mfcdma6é4 command: 1s is the local-storage 
address, ea is the effective address in system memory, tag is the DMA tag, tidis the transfer class identifier, and 
rid is the replacement class identifier. Transfer size is fixed at 4 bytes. This command and all subsequent 
commands with the same tag ID as this command are locally ordered with respect to all previously issued 
commands within the same tag group and command queue. 


Implementation 


spu_mfcdma64(ls, mfc_ea2h (ea), mfc_ea2l(ea), 4, tag, 
((tid<<24) | (rid<<16) |MFC_SNDSIGB CMD) ) 


mfc_sndsigf: Send Signal with Fence 


(void) mfc_sndsigf (volatile void *ls, uint64 t ea, uint32 t tag, uint32 t tid, 
uint32 t rid) 


An mfc_sndsigf command is enqueued into the DMA queue, or is stalled when the DMA queue is full. The 
arguments to this function correspond to the arguments of the sou_mfcdmaé4 command: 1s is the local-storage 
address, ea is the effective address in system memory, tag is the DMA tag, tidis the transfer class identifier, and 
rid is the replacement class identifier. Transfer size is fixed at 4 bytes. This command is locally ordered with 
respect to all previously issued commands within the same tag group and command queue. 


Implementation 


spu_mfcdma64(ls, mfc_ea2h (ea), mfc_ea2l(ea), 4, tag, 
( (tid<<24) | (rid<<16) |MFC_SNDSIGF_CMD) ) 


mfc_barrier: Enqueue mfc_barrier Command into DMA Queue or Stall When Queue is Full 
(void) mfc_barrier(uint32 t tag) 
An mfc barrier command is enqueued into the DMA queue, or the command is stalled when the DMA queue is 
full. tag is the DMA tag. An mfc_barrier command guarantees that MFC commands preceding the barrier will be 


executed before the execution of MFC commands following it, regardless of the tag of preceding or subsequent 
MFC commands. 


Implementation 
spu_mfcdma32(0, 0, 0, tag, MFC BARRIER CMD) 
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mfc_eieio: Enqueue mfc_eieio Command into DMA Queue or Stall When Queue is Full 
(void) mfc_eieio (uint32 t tag, uint32 t tid, uint32_t rid) 
An mfc_eieio command is enqueued into the DMA queue, or the command is stalled when the DMA queue is full. 
tag is the DMA tag, tidis the transfer class identifier, and rid is the replacement class identifier. Do not use this 


command to maintain the order of commands immediately inside a single SPE. The mfc_eieio command is 
designed to use inter-processor/device synchronization. This command creates a large load on the memory system. 


Implementation 














spu_mfcdma32(0, 0, 0, tag, ((tid<<24) | (rid<<16) |MFC_EIEIO CMD) ) 


mfc_sync: Enqueue mfc_sync Command into DMA Queue or Stall When Queue is Full 
(void) mfc_sync (uint32 t tag) 


An mfc_sync command is enqueued into the DMA queue, where tag is the DMA tag, or the command is stalled 
when the DMA queue is full. This function must not be used to maintain the order of commands immediately inside 
a single SPE. The mfc_sync command is designed to use inter-processor/device synchronization. This command 
creates a large load on the memory system. 
Implementation 

spu_mfcdma32(0, 0, 0, tag, MFC SYNC CMD) 


4.7. MFC DMA Status 


This section describes functions that can be used to check the completion of MFC commands or the status of 
entries in the MFC DMA queue. 


mfc_stat_cmd_queue: Check the Number of Available Entries in the MFC DMA Queue 


(uint32 t) mfc stat cmd queue (void) 





The number of available entries in the MFC DMA queue is checked. This information can be used to avoid stalling 
the execution of an SPU program if a DMA command is issued to a full queue. A full queue is 16 entries. 


Implementation 
spu_readchcnt (MFC_Cmd) 


mfc_write_tag_mask: Set Tag Mask to Select MFC Tag Groups to be Included in Query Operation 


(void) mfc write tag mask (uint32_t mask) 


A tag mask is set to select the MFC tag groups to be included in the query operation, where mask is the DMA tag- 
group query mask. Each bit of mask indicates each tag group; tag 0 is mapped to LSB. 


Implementation 


spu_writech (MFC _WrTagMask, mask) 


mfc_read_tag_mask: Read Tag Mask Indicating MFC Tag Groups to be Included in Query Operation 
(uint32 t) mfc read tag mask (void) 





The tag mask is read to identify MFC tag groups to be included in the query operation. Each bit of the mask 
indicates each tag group; tag 0 is mapped to LSB. The result represents a DMA tag-group query mask. 


Implementation 
spu_readch (MFC_RdTagMask) 
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mfc_write_tag_update: Request That Tag Status be Updated 
(void) mfc write tag update (uint32_t ts) 


A request is sent to the MFC to update tag status, where ts specifies a tag-status update condition shown in 
Table 4-105. 


This function must precede a tag-status read with the mfc_read_tag_status() function. A tag-status update 
request should be performed after setting the tag-group mask with the mfc_write_tag_mask () function. 


Table 4-105: MFC Write Tag Update Conditions“ 





Number Mnemonic Description 

0 MFC_TAG_UPDATE_IMMEDIATE Update immediately, unconditionally. 
MFC_TAG_UPDATE_ANY oe 
2 MFC_TAG_UPDATE_ALL Update tag status if or when all enabled tag groups have 


“no outstanding operation” status. 





1 Condition enumerants are defined in spu_mfcio.h. 
Implementation 


spu_writech (MFC _WrTagUpdate, ts) 


mfc_write_tag_update_immediate: Request That Tag Status be Immediately Updated 


(void) mfc write tag update immediate (void) 





A request is sent to immediately update tag status. 


Implementation 

















spu_writech (MFC WrTagUpdate, MFC_TAG UPDATE IMMEDIATE) 


mfc_write_tag_update_any: Request That Tag Status be Updated for Any Enabled Completion with No 
Outstanding Operation 


(void) mfc write tag update any (void) 





A request is sent to update tag status when any enabled MFC tag-group completion has a “no operation 
outstanding” status. 


Implementation 


spu_writech (MFC _WrTagUpdate, MFC_TAG UPDATE ANY) 





mfc_write_tag_update_all: Request That Tag Status be Updated When All Enabled Tag Groups Have No 
Outstanding Operation 


(void) mfc write tag update all (void) 





A request is sent to update tag status when all enabled MFC tag groups have a “no operation outstanding” status. 


Implementation 


spu_writech (MFC _WrTagUpdate, MFC_TAG UPDATE ALL) 





mfc_stat_tag_update: Check Availability of Tag Status Update Request Channel 
(uint32 t) mfc stat tag update (void) 





The availability of the Tag Status Update Request channel is checked. The result has one of the following values: 


e 0: The Tag Status Update Request channel is not yet available. 
e 1: The Tag Status Update Request channel is available. 
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Implementation 
spu_readchcnt (MFC_WrTagUpdate) 


mfc_read_tag_status: Wait for an Updated Tag Status 


(uint32 t) mfc read tag status (void) 











The status of the tag groups is requested. Unless the tag update is set to MFC_TAG UPDATE IMMEDIATE, this call 
could be blocked. Each bit of a returned value indicates the status of each tag group; tag 0 is mapped to LSB. If set, 
the tag group has no outstanding operation (that is, commands completed) and is not masked by the query. 











Only the status of the enabled tag groups at the time of the tag-group status update are valid. The bit positions that 
correspond to the tag groups that are disabled at the time of the tag-group status update are set to 0. 


Implementation 
spu_readch (MFC_RdTagStat) 


mfc_read_tag_status_immediate: Wait for the Updated Status of Any Enabled Tag Group 


(uint32 t) mfc read tag status immediate (void) 








A request is sent to immediately update tag status. The processor waits for the status to be updated. 


Implementation 
spu_mfcstat (MFC_TAG UPDATE IMMEDIATE 

















mfc_read_tag_status_any: Wait for No Outstanding Operation of Any Enabled Tag Group 


(uint32 t) mfc read tag status any(void) 








A request is sent to update tag status when any enabled MFC tag-group completion has a “no operation 
outstanding” status. The processor waits for the status to be updated. 


Implementation 
spu_mfcstat (MFC_TAG UPDATE ANY) 





mfc_read_tag_status_all: Wait for No Outstanding Operation of All Enabled Tag Groups 
(uint32 t) mfc read tag status all(void) 








A request is sent to update tag status when all enabled MFC tag groups have a “no operation outstanding” status. 
The processor waits for the status to be updated. 


Implementation 
spu_mfcstat (MFC_TAG UPDATE ALL) 





mfc_stat_tag_status: Check Availability of MFC_RdTagStat Channel 


(uint32 t) mfc stat tag status (void) 





The availability of MFC_RdTagStat channel is checked, and one of the following values is returned: 


e 0: The status is not yet available. 
e 1: The status is available. 
This function is used to avoid a channel stall caused by reading the MFc_RdTagStat channel when a status is not 


available. 


Implementation 
spu_readchcnt (MFC_RdTagStat) 
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mfc_read_list_stall_status: Read List DMA Stall-and-Notify Status 
(uint32 t) mfc read list stall status (void) 








The List DMA stall-and-notify status is read and returned, or the program is stalled until the status is available. 


Implementation 
spu_readch (MFC_RdListStallStat) 


mfc_stat_list_stall_status: Check Availability of List DMA Stall-and-Notify Status 
(uint32 t) mfc stat list stall status (void) 








The availability of the List DMA stall-and-notify status is checked, and one of the following values is returned: 
e 0: The status is not yet available. 


e 1: The status is available. 


Implementation 
spu_readchcnt (MFC_RdListStallStat) 


mfc_write_list_stall_ack: Acknowledge Tag Group Containing Stalled DMA List Commands 
(void) mfc write list _stall_ack(uint32 t tag) 





An acknowledgement is sent with respect to a prior stall-and-notify event. (See mfc_read_ list status and 
mfc stat list stall status.) The argument tag is the DMA tag. 





Implementation 
spu_writech (MFC _WrListStallAck, tag) 


mfc_read_atomic_status: Read Atomic Command Status 


(uint32 t) mfc read atomic status (void) 





The atomic command status is read, or the program is stalled until the status is available. As shown in Table 4-106, 
one of the following atomic command status results (binary value of bits 29 through 31) is returned: 


Table 4-106: Read Atomic Command Status or Stall Until Status Is Available’ 





Status Mnemonic Description 
1 MFC_PUTLLC_STATUS 0: The mfc_putllc command succeeded 
1: The mfe putllc command failed (reservation lost). 
2 MFC_PUTLLUC_STATUS The mfc_putlluc command was completed. 
4 MFC_GETLLAR_STATUS The mfc_getllar command was completed. 





1 Status enumerants are defined in spu_mfcio.h. 


Implementation 
spu_readch (MFC_RdAtomicStat) 


mfc_stat_atomic_status: Check Availability of Atomic Command Status 


(uint32 t) mfc stat atomic status (void) 





The availability of the atomic command status is checked, and one of the following values is returned: 


e 0: An atomic DMA command has not yet completed. 
e 1: An atomic DMA command has completed and the status is available. 


Implementation 
spu_readchcnt (MFC_RdAtomicStat) 


C/C++ Language Extensions for Cell Broadband Engine™ Architecture, Version 2.3 


SONY 


SONY 


Programming Support for MFC Input and Output 


SOMEMER © 


4.8. MFC Multisource Synchronization Request 


The Cell Broadband Engine™ Architecture describes the MFC Multisource Synchronization Facility. In that 
document, a cumulative ordering is broadly defined as an ordering of storage accesses performed by multiple 
processors or units with respect to another processor or unit. In this section, several functions are described that 
can be used to achieve a cumulative ordering across local and main storage address domains. 


mfc_write_multi_src_sync_request: Request Multisource Synchronization 


(void) mfc write multi src sync request (void) 





A request is sent to start tracking outstanding transfers sent to the associated MFC. When the requested 
synchronization is complete, the channel count of the MFC Multisource Synchronization Request channel is reset to 
one. 
Implementation 

spu_writech (MFC_WrMSSyncReq, 0) 


mfc_stat_multi_src_sync_request: Check the Status of Multisource Synchronization 


(uint32 t) mfc stat multi src sync request (void) 








The channel count of the MFC Multisource Synchronization Request channel is read, and one of the following 
values is returned: 


e 0: Outstanding transfers are being tracked. 
e 1: The synchronization requested by mfc write multi src sync request is complete. 





Implementation 
spu_readchcnt (MFC_WrMSSyncReq) 


4.9. SPU Signal Notification 


In this section, functions are described that can be used to read signals from other processors and other devices in 
the system. 


spu_read_signal1: Atomically Read and Clear Signal Notification 1 Channel 


(uint32 t) spu_read_signall (void) 


The Signal Notification 1 channel is read, and any bits that are set are atomically reset. A signal is returned. If no 
signals are pending, this function will stall the SPU until a signal is issued. 


Implementation 
spu_readch(SPU_RdSigNotifyl) 


spu_stat_signal1: Check if Pending Signals Exist on Signal Notification 1 Channel 


(uint32 t) spu_stat_signall (void) 


A check is made to determine whether any pending signals exist on the Signal Notification 1 channel. One of the 
following values is returned: 


e 0: No signals are pending. 


e 1: Signals are pending. 


Implementation 
spu_readchcnt (SPU_RdSigNotifyl1) 
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spu_read_signal2: Atomically Read and Clear Signal Notification 2 Channel 
(uint32_t) spu read signal2 (void) 


The Signal Notification 2 channel is read, and any bits that are set are atomically reset. A signal is returned. If no 
signals are pending, a call of this function stalls the SPU until a signal is issued. 


Implementation 
spu_readch(SPU_RdSigNotify2) 


spu_stat_signal2: Check if Any Pending Signals Exist on Signal Notification 2 Channel 
(uint32 t) spu_stat_signal2 (void) 


A check is made to determine whether any pending signals exist on the Signal Notification 2 channel. One of the 
following values is returned: 


e 0: No signals are pending. 
e 1: Signals are pending. 


Implementation 
spu_readchcnt (SPU_RdSigNotify2) 


4.10. SPU Mailboxes 


This section describes functions that can be used to manage SPU Mailboxes. 


spu_read_in_mbox: Read Next Data Entry in SPU Inbound Mailbox 


(uint32 t) spu read in mbox (void) 





The next data entry in the SPU Inbound Mailbox queue is read. The command stalls when the queue is empty. The 
application-specific mailbox data is returned. Each application can uniquely define the mailbox data. 


Implementation 
spu_readch(SPU_RdInMbox) 


spu_stat_in_mbox: Get the Number of Data Entries in SPU Inbound Mailbox 


(uint32 t) spu stat in mbox (void) 





The number of data entries in the SPU Inbound Mailbox is returned. If the returned value is non-zero, the mailbox 
contains data entries that have not been read by the SPU. 


Implementation 
spu_readchcnt (SPU_RdInMbox) 


spu_write_out_mbox: Send Data to SPU Outbound Mailbox 


(void) spu_write out_mbox (uint32_ t data) 


Data is sent to the SPU Outbound Mailbox, where data is application-specific mailbox data, or the command stalls 
when the SPU Outbound Mailbox is full. 


Implementation 
spu_writech(SPU WrOutMbox, data) 
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spu_stat_out_mbox: Get Available Capacity of SPU Outbound Mailbox 


(uint32 t) spu stat out mbox (void) 





The available capacity of the SPU Outbound Mailbox is returned. A value of zero indicates that the mailbox is full. 


Implementation 
spu_readchcnt (SPU_WrOutMbox) 


spu_write_out_intr_mbox: Send Data to SPU Outbound Interrupt Mailbox 


(void) spu_write out_intr_ mbox (uint32_ t data) 





Data is sent to the SPU Outbound Interrupt Mailbox, where data is application-specific mailbox data. The command 
stalls when the SPU Outbound Interrupt Mailbox is full. 


Implementation 
spu_writech (SPU WrOutIntrMbox, data) 


spu_stat_out_intr_mbox: Get Available Capacity of SPU Outbound Interrupt Mailbox 


(uint32 t) spu stat out intr mbox (void) 








The available capacity of the SPU Outbound Interrupt Mailbox is returned. A value of zero indicates that the mailbox 
is full. 


Implementation 
spu_readchcnt (SPU_WrOutIntrMbox) 


4.11. SPU Decrementer 


This section describes functions that use the SPU 32-bit decrementer. 


spu_read_decrementer: Read Current Value of Decrementer 





(uint32 t) spu read decrementer (void) 


The current value of the decrementer is read and returned. 


Implementation 
spu_readch(SPU_RdDec) 


spu_write_decrementer: Load a Value to Decrementer 


(void) spu write decrementer (uint32 t count) 


A count is loaded to the decrementer. 


Implementation 


spu_writech (SPU WrDec, count) 
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4.12. SPU Event 


This section describes several functions that can be used to monitor SPU events. See the Cell Broadband Engine” 
Architecture for a description of the SPU Event Facility. 
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The bit-fields of the Event Status, the Event Mask, and the Event Ack are shown in Table 4-107. 


Table 4-107: MFC Event Bit-Fields' 





Bits 

0x1000 
0x0800 
0x0400 
0x0200 
0x0100 
0x0080 
0x0040 
0x0020 
0x0010 
0x0008 
0x0002 
0x0001 


Field Name 

MFC_MULTI_SRC_SYNC_EVENT 
MFC_PRIV_ATTN_EVENT 
MFC_LLR_LOST_EVENT 
MFC_SIGNAL_NOTIFY_1_EVENT 
MFC_SIGNAL_NOTIFY_2 EVENT 
MFC_OUT_MBOX_AVAILABLE_EVENT 
MFC_OUT_INTR_MBOX_AVAILABLE_EVENT 
MFC_DECREMENTER_EVENT 
MFC_IN_MBOX_AVAILABLE_EVENT 
MFC_COMMAND_QUEUE_AVAILABLE_EVENT 
MFC_LIST_STALL_NOTIFY_EVENT 
MFC_TAG_STATUS_UPDATE_EVENT 

















Description 

Multisource synchronization event 

SPU privileged attention event 

Lock-line reservation lost event 

SPU Signal Notification 1 available event 

SPU Signal Notification 2 available event 

SPU Outbound Mailbox available event 

SPU Outbound Interrupt Mailbox available event 
SPU decrementer event 

SPU Inbound Mailbox available event 

MFC SPU command queue available event 
MFC DMA List command stall-and-notify event 
MFC tag-group status update event 





' Bit-field names are defined in spu_mfcio.h. 


spu_read_event_status: Read Event Status or Stall Until Status is Available 


(uint32 t) 


spu read event status (void) 





The event status is read and returned. The command stalls until the status is available. Events that have been 
reported but not acknowledged will continue to be reported until acknowledged. 


The return value is the value of the SPU Read Event Status channel. 


Implementation 


spu_readch(SPU_RdEventStat) 





spu_stat_event_status: Check Availability of Event Status 


(uint32 t) 


spu stat event status (void) 





The event status is checked, and one of the following values is returned: 


e 0: No enabled events occurred. 


e 1: Enabled events are pending. 


Implementation 





spu_readchcnt (SPU_RdEventStat) 


spu_write_event_mask: Select Events to be Monitored by Event Status 


(void) 


spu_write event_mask 


(uint32_t mask) 


Events are selected to be monitored by event status. The argument, mask, is the event mask. 


Implementation 





spu_writech(SPU_ WrEventMask, mask) 
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spu_write_event_ack: Acknowledge Events 
(void) spu_write event _ack (uint32_t ack) 


This function acknowledges that the corresponding events are being serviced by the software. The status of 
acknowledged events is reset, and the events are resampled. The argument, ack, represents events 
acknowledgment. 


Implementation 





spu_writech (SPU WrEventAck, ack) 


spu_read_event_mask: Read Event Status Mask 


(uint32 t) spu read event mask (void) 





The current Event Status Mask is read, and the mask is returned. 


Implementation 
spu_readch(SPU_RdEventMask) 





4.13. SPU State Management 
This section describes functions that relate to interrupts. See the Cell Broadband Engine” Architecture for a 


description of the SPU Machine Status channel and the SPU interrupt-related channels. 


spu_read_machine_status: Read Current SPU Machine Status 


(uint32 t) spu read machine status (void) 
The current SPU machine status is read, and the status is returned. 


Implementation 
spu_readch(SPU_RdMachStat) 


spu_write_srr0: Write to SPU SRRO 


(void) spu write srr0(uint32 t srr0) 


The value of srrQ is written to the SPU state save/restore register 0 (SRRO). 


Implementation 
spu_writech (SPU _WrSRRO,srr0) 


spu_read_srr0: Read SPU SRRO 


(uint32 t) spu_read_srr0 (void) 


The SPU state save/restore register 0 (SRRO) is read, and the state is returned. 


Implementation 
spu_readch (SPU_RdSRRO) 
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5. SPU and Vector Multimedia Extension Intrinsics 


Function mapping techniques can be used to increase the portability of source code written with SPU intrinsics. One 
important set of intrinsic function mappings is between the SPU and PPU. This chapter describes a minimal 
mapping between SPU intrinsics and PPU Vector Multimedia Extension intrinsics. 


For many intrinsic functions, an efficient one-to-one mapping between architectures will exist. For some functions, 
there could be a less efficient one-to-many instruction mapping; and for other functions, no straightforward mapping 
will exist because a mapping is either impractical or impossible to implement. In this document, only one-to-one 
mappings are identified for the SPU and PPU. For those SPU and PPU intrinsic functions for which there is no 
straightforward mapping, an explanation of the difficulty in mapping is provided. 


The mappings between SPU and PPU intrinsics are defined in two header files: vmx2spu.h and spu2vmx.h. The 
former maps Vector Multimedia Extension intrinsics to generic SPU intrinsics, and the latter maps generic SPU 
intrinsics to Vector Multimedia Extension intrinsics. The functions that are defined in these two header files can be 
implemented as overloaded inline functions. To facilitate implementation, the vector data types must also be 
mapped. 


The header file vec_types.h is provided to declare the single token vector data types for the Vector Multimedia 
Extension vector data types and to perform type mappings between the SPU and Vector Multimedia Extension. 
Programmers must similarly declare vector data using these single token data types. The single token vector data 
types for the Vector Multimedia Extension intrinsics are shown in Table 5-108. 


Table 5-108: Vector Multimedia Extension Single Token Vector Data Types 





Vector Keyword Data Type Single Token Typedef 
vector unsigned char vec_uchar16 
vector signed char vec_char16 
vector bool char vec_bchar16 
vector unsigned short vec_ushort8 
vector signed short vec_short8 
vector bool short vec_bshort8 
vector unsigned int vec_uint4 
vector signed int vec_int4 
vector bool int vec_bint4 
vector float vec_float4 
vector pixel vec_pixel8 





5.1. Mapping of Vector Multimedia Extension Intrinsics to SPU Intrinsics 


This section lists the one-to-one mapping of Vector Multimedia Extension intrinsics to SPU intrinsics. It also lists 
those Vector Multimedia Extension intrinsics that are difficult to map to SPU intrinsics. 


5.1.1. One-to-One Mapped Intrinsics 


The Vector Multimedia Extension intrinsics that map one-to-one with the generic SPU intrinsics are shown in 
Table 5-109. 


Table 5-109: Vector Multimedia Extension Intrinsics That Map One-to-One with SPU Intrinsics 





Generic Vector Multimedia 








Extension Intrinsic Maps to SPU Intrinsic Applicable Data Type(s) 
vec_add spu_add halfword, word, and float (not byte) 
vec_addc spu_genc All 
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Generic Vector Multimedia 


Extension Intrinsic 


Maps to SPU Intrinsic 


Applicable Data Type(s) 








vec_and 
vec_andc 
vec_avg 
vec_cmpeq 
vec_cmpgt 
vec_cmplt 
vec_ctf 
vec_cts 
vec_ctu 
vec_madd 
vec_mule 
vec_mulo 
vec_nmusb 
vec_nor 
vec_or 
vec_re 
vec_rl 
vec_rsqrte 
vec_sel 
vec_sub 
vec_subc 
vec_xor 


spu_and All 

spu_andc All 

spu_avg unsigned char 
spu_cmpeq All 

spu_cmpgt All 

spu_cmpgt All (requires parameter reordering) 
spu_convtf All 

spu_convts All 

spu_convtu All 

spu_madd all 

spu_mule halfword (not byte) 
spu_mulo halfword (not byte) 
spu_nmsub All 

spu_nor All 

spu_or All 

spu_re All 

spu_rl halfword, word (not byte) 
spu_rsqrte All 

spu_sel All 

spu_sub halfword, word, float 
spu_genb All 

spu_xor all 





5.1.2. Vector Multimedia Extension Intrinsics That Are Difficult to Map to SPU Intrinsics 


The Vector Multimedia Extension intrinsics that are shown in Table 5-110 are not likely to be mapped to generic 
SPU intrinsics because a straightforward mapping does not exist. 


Table 5-110: Vector Multimedia Extension Intrinsics That Are Difficult to Map to SPU Intrinsics 





Generic Vector Multimedia 


Extension Intrinsic(s) 


vec_unpackh, vec_unpackl 


vec_mfvscr, vec_mtvscr 


vec_step 


Explanation 


These functions cannot be mapped without creating additional SPU data types. A 
mapping of pixel and bool short vector types to an unsigned short (as 
described in Table 1-2) will cause an overloaded function selection conflict. 


Support of the VSCR register is difficult because the SPU does not support IEEE 
rounding modes on single-precision floating-point operations. 


Mapping requires specific compiler support that is not mandated by this 
specification. 





5.2. Mapping of SPU Intrinsics to Vector Multimedia Extension Intrinsics 


This section lists the one-to-one mapping of SPU intrinsics to Vector Multimedia Extension intrinsics. It also lists 
those SPU intrinsics that are difficult to map to Vector Multimedia Extension intrinsics. 


5.2.1. One-to-One Mapped Intrinsics 


Many of the generic SPU intrinsics map one-to-one with Vector Multimedia Extension intrinsics. These mappings 


are shown in Table 5-111. 
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Table 5-111: SPU Intrinsics That Map One-to-One with Vector Multimedia Extension Intrinsics 





Maps to Vector Multimedia 


Generic SPU Intrinsic 


Extension Intrinsic 


Applicable Data Type(s) 








spu_add vec_add 
spu_and vec_and 
spu_andc vec_andc 
spu_avg vec_avg 
spu_cmpeq vec_cmpeq 
spu_cmpgt vec_cmpgt 
spu_convtf vec_ctf 
spu_convts vec_cts 
spu_convtu vec_ctu 
spu_genb vec_subc 
spu_genc vec_addc 
spu_madd vec_madd 
spu_mule vec_mule 
spu_mulo vec_mulo 
spu_nmsub vec_nmsub 
spu_nor vec_nor 
spu_or vec_or 
spu_re vec_re 
spu_rl vec_rl 
spu_rsqrte vec_rsqrte 
spu_sel vec_sel 
spu_sub vec_sub 
spu_xor vec_xor 


vector/vector (no scalar operands) 
vector/vector (no scalar operands) 
All 

All 

vector/vector (no scalar operands) 
vector/vector (no scalar operands) 
Limited scale range (5 bits) 
Limited scale range (5 bits) 
Limited scale range (5 bits) 

All 

All 

float 

All 

Halfword vector/vector (no scalar operands) 
float 

All 

vector/vector (no scalar operands) 
All 

vector/vector (no scalar operands) 
all 

All 

vector/vector (no scalar operands) 
vector/vector (no scalar operands) 





§.2.2. SPU Intrinsics That Are Difficult to Map to Vector Multimedia Extension Intrinsics 


The generic SPU intrinsics that are shown in Table 5-112 are not likely to be mapped to Vector Multimedia 
Extension intrinsics because a straightforward mapping does not exist. 


Table 5-112: SPU Intrinsics That Are Difficult to Map to Vector Multimedia Extension Intrinsics 





Generic SPU Intrinsic(s) 
spu_bisled, spu_bislede, spu_bisledi 
spu_idisable, spu_ienable 


spu_readch, spu_readchqw, spu_readchcnt 


spu_writech, spu_writechqw 


spu_mfcdma32, spu_mfcdma64, spu_mfcstat 


spu_sync, spu_sync_c 


spu_dsync 


spu_convts, spu_convtu, spu_convtf 


Explanation 


Event handling and interrupt handling on the SPU cannot be 
precisely mapped. 


Specific channel functionality cannot be easily supported on the PPU, 
nor would it generally be desirable to do so. Whereas some channel 
sequences could be mapped, most would require special 
programmer insight and direction. 


The mapping of DMA transactions typically is not needed because 
the PPU has full memory access. Nevertheless, these intrinsics could 
be used to perform memory synchronization that might not be 
precisely mappable. 


These intrinsics could be mapped to one of the PPU sync 
instructions, but the results might not be what was intended. 


The full dynamic range of scale factors is not easily supported. 
Vector Multimedia Extension provides a 5-bit scale factor; the SPU 
has an 8-bit scale factor. Some implementations might support only 
the 5-bit range provided by the direct mapping of the equivalent 
intrinsics. 


C/C++ Language Extensions for Cell Broadband Engine™ Architecture, Version 2.3 


75 


SONY 


SONY 





76 SPU and Vector Multimedia Extension Intrinsics awe 
Generic SPU Intrinsic(s) Explanation 
spu_hcmpegq, spu_hcempgt The halt instruction might be mappable to an exit function, but this 
will not work in all environments. 


spu_stop, spu_stopd It is not always appropriate to stop execution of the PPU. 
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6. PPU VMX Intrinsics 


This chapter describes intrinsics which make the underlying PPU VMX instruction set accessible from the C and 
C++ programming languages. The AltiVec Technology Programming Interface Manual, Section 4.4, defines most of 
the generic intrinsics for the PPU VMX instruction set, except for a few new instructions which are specified in this 


chapter. The new intrinsics are in two different categories: intrinsics for extracting vector elements and intrinsics for 
inserting vector elements. 


The PPU VMX intrinsics will be declared in the system header file altivec .h but they may be either defined as 
macros within this header or implemented internally within the compiler. 


For data prefetches, the dcbt, dcbtst, dcbt_TH1000,and_—_dcbt_TH1010 intrinsics should be used. 


The related stream control operations that are defined in the AltiVec Technology Programming Interface Manual, 
which are listed below, have been deprecated on the PPU and will execute as a NOP. 


Table 6-113: Stream Control Operators That Have Been Deprecated on the PPU 











Stream Control Operator Description 
vec_dss(a) Vector Data Stream Stop 
vec_dssall() Vector Stream Stop All 
vec_dst(a,b,c) Vector Stream Touch 
vec_dstst(a,b,c) Vector Data Stream Touch for Store Transient 
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vec_extract: Extract Vector Element From Vector 


d = vec_extract (a, element) 


The element that is specified by element is extracted from vector a and returned in scalar d. Depending on the 
size of the element, only a limited number of the least significant bits of the element index are used. Specifically for 
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1-, 2-, and 4-byte elements, only four, three, and two of the least significant bits are used, respectively. 


Table 6-114: Extract Vector Element From Vector 





Return/Argument Types 
a 


Assembly Mapping’ 








unsigned char 


signed char 


unsigned short 


signed short 


unsigned int 


signed int 


float 


vector unsigned char 


vector signed char 


vector unsigned short 


int 


vector signed short 


vector unsigned int 


vector signed int 


vector float 


EA=memaddr + (element&0xF) 
stvebx a, 0, EA 
Ibzx d, 0, EA 


EA=memaddr + (element&0xF) 
stvebx a, 0, EA 

Ibzx d, 0, EA 

extsb d, d 


EA=memaddr + (element&0x7)<<2 
stvehx a, 0, EA 
Inzx d, 0, EA 


EA=memaddr + (element&0x7)<<2 
stvehx a, 0, EA 

Inzx d, 0, EA 

extsh d, d 


EA=memaddr + (element&0x3)<<3 
stvewx a, 0, EA 
lwzx a, 0, EA 


EA=memaddr + (element&0x3)<<3 
stvewx a, 0, EA 

lwzx a, 0, EA 

extsw d, d è 


EA=memaddr + (element&0x3)<<3 
stvewx a, 0, EA 
lfsx a, 0, EA 





1 memaddr is the address of a temporary memory location which is 16-byte aligned. 


2 The sign extend from word to doubleword can be omitted if the processor is running in 32-bit mode. 
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vec_insert: Insert Scalar into Specified Vector Element 
d = vec_insert(a, b, element) 


Scalar a is inserted into the element of vector b that is specified by the element parameter, and the modified 
vector is returned. All other elements of b are unmodified. Depending on the size of the element, only a limited 
number of the least significant bits of the element index are used. Specifically for 1-, 2-, and 4-byte elements, only 
four, three, and two of the least significant bits are used, respectively. 


Table 6-115: Insert Scalar into Specified Vector Element 











Return/Argument Types Assembly Mapping’ 
d a b element 
; ; ; EA=memaddr + (element&0xF) 
t d ch d ch t d ch 
vector unsigned char unsigned char vector unsigned char stbx a, 0, EA 
vector signed char signed char vector signed char Ivebx'd, 0, EA 


vperm d, d, a, pattern 


EA=memaddr + (element&0x7)<<2 
sthx a, 0, EA 

Ivehx d, 0, EA 

vperm d, d, a, pattern 


vector unsigned short unsigned short vector unsigned short 


vector signed short signed short vector signed short 

int 
vector unsigned int unsigned int vector unsigned int EA=memaddr + (element&0x3)<<3 
stwx a, 0, EA 
Ivewx d, 0, EA 
vperm d, d, a, pattern 


EA=memaddr + (element&0x3)<<3 
stfsx a, EA 

Ivewx d, 0, EA 

vperm d, d, a, pattern 


vector signed int signed int vector signed int 


vector float float vector float 





1 memaddr is the address of a temporary memory location which is 16-byte aligned. 
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vec_lIvix: Load Vector Left Indexed 


d = vec_lvlx(a, b) 


+ 


Let EA be the effective address formed from the sum of the contents of a and the contents of b and let eb be the 
value of the four least significant bits of EA. The (16 - eb) bytes addressed by EA are loaded into the leftmost (16 - 
eb) byte elements of d and the rightmost eb byte of d are set to zero. 











Table 6-116: Load Vector Left Indexed 





Return/Argument Types 
d a b 
unsigned char * 


Assembly Mapping 








vector unsigned char any integral type : 
vector unsigned char * 
; , signed char * 
vector signed char any integral type ; 
vector signed char * 
vector bool char any integral type vector bool char * 
. p unsigned short * 
vector unsigned short any integral type : 
vector unsigned short * 
, , signed short * 
vector signed short any integral type ; 
vector signed short * 
: Ivix d, a, b 
vector bool short any integral type vector bool short * 
vector pixel any integral type vector pixel * 
. s P unsigned int * 
vector unsigned int any integral type : ; 
vector unsigned int * 
. , p signed int * 
vector signed int any integral type : ; 
vector signed int * 
vector bool int any integral type vector bool int * 
, float * 
vector float any integral type 


vector float * 
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vec_Ivixl: Load Vector Left Indexed Last 


d = vec_lvlxl(a, b) 


Le 


+ 





value of the four least significant bits of 1 
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EA be the effective address formed from the sum of the contents of a and the contents of b and let eb be the 
EA. The (16 - eb) bytes addressed by EA are loaded into the leftmost (16 - 





eb) bytes of d and the rightmost eb bytes of d are set to zero. vec_1v1x1 provides a hint that the quadword in 
EA will probably not be needed again by the program in the near future. 


memory addressed by ! 





Table 6-117: Load Vector Left Indexed Last 





d 


Return/Argument Types 


a 


b Assembly Mapping 








vector unsigned char 


vector signed char 
vector bool char 


vector unsigned short 


vector signed short 


vector bool short 
vector pixel 


vector unsigned int 


vector signed int 
vector bool int 


vector float 


any integral type 


any integral type 
any integral type 


any integral type 


any integral type 


any integral type 
any integral type 


any integral type 


any integral type 
any integral type 


any integral type 


unsigned char * 

vector unsigned char * 
signed char * 

vector signed char * 
vector bool char * 
unsigned short * 
vector unsigned short * 
signed short * 

vector signed short * 
vector bool short * evan 
vector pixel * 

unsigned int * 

vector unsigned int * 
signed int * 

vector signed int * 
vector bool int * 

float * 


vector float * 
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d = vec_lvrx(a, b) 


Le 


= 





value of the four least significant bits of 1 
eb bytes in memory addressed by (1 
bytes of dare set to zero. If eb is equal to zero (for example, 1 


set to zero. 


vec_Ivrx: Load Vector Right Indexed 








Table 6-118: Load Vector Right Indexed 
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SONY 


EA be the effective address formed from the sum of the contents of a and the contents of b and let eb be the 
EA. If eb is not equal to zero (for example, 1 


EA is not quadword-aligned), then 


EA - eb) are loaded into the rightmost eb bytes of d and the leftmost (16 - eb) 
EA is quadword-aligned), then the contents of dare 





d 


Return/Argument Types 


a 


b 


Assembly Mapping 








vector unsigned char 


vector signed char 
vector bool char 


vector unsigned short 


vector signed short 


vector bool short 
vector pixel 


vector unsigned int 


vector signed int 
vector bool int 


vector float 


any integral type 


any integral type 
any integral type 


any integral type 


any integral type 


any integral type 
any integral type 


any integral type 


any integral type 
any integral type 


any integral type 


unsigned char * 
vector unsigned char * 
signed char * 

vector signed char * 
vector bool char * 
unsigned short * 
vector unsigned short * 
signed short * 

vector signed short * 
vector bool short * 
vector pixel * 
unsigned int * 

vector unsigned int * 
signed int * 

vector signed int * 
vector bool int * 

float * 


vector float * 


Ivrx d, a, b 
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vec_Ivrxl: Load Vector Right Indexed Last 


d = vec_lvrxl (a,b) 


+ 


Let EA be the effective address formed from the sum of the contents of a and the contents of b and let eb be the 
value of the four least significant bits of EA. If eb is not equal to zero (for example, EA is not quadword-aligned), then 
eb bytes in memory addressed by (EA - eb) are loaded into the rightmost eb bytes of d and the leftmost (16 - eb) 
bytes of dare set to zero. If eb is equal to zero (for example, EA is quadword-aligned), then the contents of dare 
set to zero. vec_1vrx1 provides a hint that the quadword in memory addressed by EA will probably not be needed 
again by the program in the near future. 




















Table 6-119: Load Vector Right Indexed Last 











Return/Argument Types Assembly Mapping 
d a b 
vector unsigned char any integral type Unsigned char 
vector unsigned char * 
f , signed char * 
vector signed char any integral type : 
vector signed char * 
vector bool char any integral type vector bool char * 


: , unsigned short * 
vector unsigned short any integral type i 
vector unsigned short * 


signed short * 


vector signed short any integral type : 
vector signed short * 
; Ivrxl d, a, b 
vector bool short any integral type vector bool short * 
vector pixel any integral type vector pixel * 
; ; ; unsigned int * 
vector unsigned int any integral type , , 
vector unsigned int * 
, , , signed int * 
vector signed int any integral type - - 
vector signed int * 
vector bool int any integral type vector bool int * 
F float * 
vector float any integral type 


vector float * 
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vec_stvlx: Store Vector Left Indexed 
(void) vec_ stvlx(a, b, c) 


Let £A be the effective address formed from the sum of the contents of b and the contents of c, and let eb be the 
value of the four least significant bits of EA. Store the (16 - eb) leftmost bytes of a into the memory addressed by EA. 











Table 6-120: Store Vector Left Indexed 











Return/Argument Types Assembly Mapping 
a b c 
vector unsigned char any integral type unsigned char 
vector unsigned char * 
: F signed char * 
vector signed char any integral type : 
vector signed char * 
vector bool char any integral type vector bool char * 
p : unsigned short * 
vector unsigned short any integral type ; 
vector unsigned short * 
, f signed short * 
vector signed short any integral type : 
vector signed short * 
F stvlx a, b, c 
vector bool short any integral type vector bool short * 
vector pixel any integral type vector pixel * 
. . p unsigned int * 
vector unsigned int any integral type : 
vector unsigned int * 
: , , signed int * 
vector signed int any integral type : : 
vector signed int * 
vector bool int any integral type vector bool int * 
, float * 
vector float any integral type 


vector float * 
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vec_ stvlxl: Store Vector Left Indexed Last 


(void) vec _ stvlxl(a, b, c) 


PPU VMX Intrinsics 


Let £A be the effective address formed from the sum of the contents of b and the contents of c, and let eb be the 
value of the four least significant bits of EA. Store the (16 - eb) leftmost bytes of a into the memory addressed by 











by the program in the near future. 


Table 6-121: Store Vector Left Indexed Last 


EA. vec_stvlx1l provides a hint that the quadword in memory addressed by EA will probably not be needed again 








Return/Argument Types 


a b 


Assembly Mapping 
c 








vector unsigned char any integral type 


vector signed char any integral type 


vector bool char any integral type 


vector unsigned short any integral type 


vector signed short any integral type 


vector bool short any integral type 


vector pixel any integral type 


vector unsigned int any integral type 


vector signed int any integral type 


vector bool int any integral type 


vector float any integral type 


unsigned char * 

vector unsigned char * 
signed char * 

vector signed char * 
vector bool char * 
unsigned short * 
vector unsigned short * 
signed short * 

vector signed short * 
vector bool short * PT ER 
vector pixel * 

unsigned int * 

vector unsigned int * 
signed int * 

vector signed int * 
vector bool int * 

float * 


vector float * 
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vec_ stvrx: Store Vector Right Indexed 


(void) vec_ stvrx(a, b, c) 


Let EA be the effective address formed from the sum of the contents of b and the contents of c, and let eb be the 
value of the four least significant bits of EA. Store the eb rightmost bytes of a into the memory addressed by (EA - 
eb). Ifebis zero, EA is 16-byte aligned, and no memory is stored. 














Table 6-122: Store Vector Right Indexed 











Return/Argument Types Assembly Mapping 
a b c 
vector unsigned char any integral type unsigned char 
vector unsigned char * 
, , signed char * 
vector signed char any integral type ; 
vector signed char * 
vector bool char any integral type vector bool char * 
: , unsigned short * 
vector unsigned short any integral type ; 
vector unsigned short * 
, , signed short * 
vector signed short any integral type : 
vector signed short * 
: stvrx a, b, c 
vector bool short any integral type vector bool short * 
vector pixel any integral type vector pixel * 
; ; : unsigned int * 
vector unsigned int any integral type - : 
vector unsigned int * 
y : , signed int * 
vector signed int any integral type - - 
vector signed int * 
vector bool int any integral type vector bool int * 
F float * 
vector float any integral type 


vector float * 
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vec_ stvrxl: Store Vector R 


(void) vec_ stvrx 


ight Indexed Last 
Lia; D; 2) 


PPU VMX Intrinsics 


Let £A be the effective address formed from the sum of the contents of b and the contents of c, and let eb be the 
value of the four least significant bits] of EA. Store the eb rightmost bytes of a into the memory addressed by (EA - 
eb). If eb is zero, EA is 16-byte aligned, no memory is stored. vec_stvrx1 provides a hint that the quadword in 
EA will probably not be needed again by the program in the near future. 











memory addressed by ! 


Table 6-123: Store Vector Right Indexed Last 











a 


Return/Argument Types 


b 


Assembly Mapping 
c 








vector unsigned char 


vector signed char 
vector bool char 


vector unsigned short 


vector signed short 


vector bool short 
vector pixel 


vector unsigned int 


vector signed int 
vector bool int 


vector float 


any integral type 


any integral type 
any integral type 


any integral type 


any integral type 


any integral type 
any integral type 


any integral type 


any integral type 
any integral type 


any integral type 


unsigned char * 

vector unsigned char * 
signed char * 

vector signed char * 
vector bool char * 
unsigned short * 
vector unsigned short * 
signed short * 

vector signed short * 
vector bool short * EE 
vector pixel * 

unsigned int * 

vector unsigned int * 
signed int * 

vector signed int * 
vector bool int * 

float * 

vector float * 





C/C++ Language Extensions for Cell Broadband Engine™ Architecture, Version 2.3 


87 


SONY 


SONY 


PPU VMX Intrinsics 


EOMERI © 


vec_promote: Promote Scalar to a Vector 


d = vec_promote (a, element) 


Scalar a is promoted to a vector containing a in the element that is specified by the element parameter, and the 
result is returned in vector d. All other elements of d are undefined. Depending on the size of a, only a limited 
number of the least significant bits of the element index are used. Specifically for 1-, 2-, and 4-byte elements, only 
four, three, and two of the least significant bits are used, respectively. 


Table 6-124: Promote Scalar to a Vector 











Return/Argument Types Assembly Mapping’ 
d a element 

vector unsigned char unsigned char EA=memaddr + (element&0xF) 
stbx a, 0, EA 

vector signed char signed char Ivebx d, 0, EA 

vector unsigned short unsigned short EA=memaddr + (element&0x7)<<2 
sthx a, 0, EA 

vector signed short signed short int Ivehx d, 0, EA 

vector unsigned int unsigned int EA=memaddr + (element&0x3)<<3 
stwx a, 0, EA 

vector signed int signed int lvewx d, 0, EA 
EA=memaddr + (element&0x3)<<3 

vector float float stfsx a, EA 
lvewx d, 0, EA 





1 memaddr is the address of a temporary memory location which is 16-byte aligned. 


vec_splats: Splat Scalar to a Vector 


d = vec_splats(a) 


The single scalar a value is replicated across all elements of a vector of the same type and the result is returned in 


vector d. 


Table 6-125: Splat Scalar to a Vector 





Return/Argument Types 


d 


Assembly Mapping 








vector unsigned char 
vector signed char 
vector unsigned short 


unsigned char 
signed char 
unsigned short 


store a into memory (EA) that 16-byte aligned 


vector signed short signed short lvebx/Ivehx/lvewx tmp, 0, EA 
vector unsigned int unsigned int vspltb/vsplth/vspltw d, tmp, 0 
vector signed int signed int 
vector float float 
vector unsigned char unsigned char (5-bit unsigned literal) vspltisb d, a 
vector signed char signed char (5-bit unsigned literal) or 
vector unsigned short unsigned short (5-bit unsigned literal) ee d,a 
vector signed short signed short (5-bit unsigned literal) vspltisw d, a 
vector unsigned int unsigned int (5-bit unsigned literal) or 
vspltisw d, a 


vector signed int 
vector float 


signed int (5-bit unsigned literal) 
float (5-bit unsigned literal) 
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7. PPU Intrinsics 


This chapter specifies a minimal set of specific intrinsics to make the underlying PPU instruction set accessible from 
the C programming language. Except for _ set f1m, each of these intrinsics has a one-to-one assembly language 
mapping, unless compiled for a 32-bit ABI in which the high and low halves of a 64-bit doubleword are maintained in 
separate registers. In this latter situation, the corresponding 32-bit intrinsic might generate a sequence of 
instructions. In other instances, a corresponding 32-bit implementation cannot be supported. 


The PPU intrinsics will be declared in the system header file, ppu_intrinsics.h. They may be either defined 
within this header as macros or implemented internally within the compiler. 


Some intrinsics take a literal value of either 3, 4, 5, 6, 8, or 10 bits in length. By default, a call to an intrinsic with an 
out-of-range literal is reported by the compiler as an error. Compilers may provide an option to issue a warning for 
out-of-range literal values and use only the specified number of least significant bits for the out-of-range argument. 


The intrinsics do not have a specific ordering unless otherwise noted. The intrinsics can be optimized by the 
compiler and be scheduled like normal operations. 


__cctph: Change Thread Priority to High 
(void) __cctph() 
The current thread priority is changed to high priority. This intrinsic will not be reordered by the compiler. 


Table 7-126: Change Thread Priority to High 
Return/Argument Types Assembly Mapping 











none cctph 





__cctpl: Change Thread Priority to Low 
(void) __cctpl() 


The current thread priority is changed to low priority. This intrinsic will not be reordered by the compiler. 


Table 7-127: Change Thread Priority to Low 





Return/Argument Types Assembly Mapping 








none cctpl 





__cctpm: Change Thread Priority to Medium 
(void) _—_cctpm() 


The current thread priority is changed to medium priority. This intrinsic will not be reordered by the compiler. 


Table 7-128: Change Thread Priority to Medium 
Return/Argument Types Assembly Mapping 











none cctpm 
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__cntlzd: Count Leading Doubleword Zeros 


d= __cntlzd(a) 


The number of leading zeros in the doubleword a is returned in d. 


Table 7-129: Count Leading Doubleword Zeros 





Return/Argument Types Assembly Mapping 
d a 64-bit ABI 32-bit ABI 








cntlzw hi_cnt, a_hi 

cntlzw lo_cnt, a_lo 

rlwinm mask, hi_cnt, 26, 0, 5 
srawi mask, mask, 31 

and lo_cnt, lo_cnt, mask 
add d, hi_cnt, lo_cnt 


unsigned int unsigned long long cntlzd d, a 





__cntlzw: Count Leading Word Zeros 
d= __cntlzw(a) 


The number of leading zeros in the word a is returned in d. 


Table 7-130: Count Leading Word Zeros 











Return/A tT 
eturn/Argument Types Agsonisiy Mapping 
d a 
unsigned int unsigned int cntlzw d, a 





__db10cyc: Delay 10 Cycles at Dispatch 
(void) — dbl0cyc() 
The current thread is blocked at dispatch for 10 cycles. This intrinsic will not be reordered by the compiler. 


Table 7-131: Delay 10 Cycles At Dispatch 
Return/Argument Types Assembly Mapping 











none db10cyc 





__db12cyc: Delay 12 Cycles at Dispatch 
(void) — dbl2cyc() 


The current thread is blocked at dispatch for 12 cycles. This intrinsic will not be reordered by the compiler. 


Table 7-132: Delay 12 Cycles At Dispatch 





Return/Argument Types Assembly Mapping 








none db12cyc 





__db16cyc: Delay 16 Cycles at Dispatch 
(void) — dblécyc() 


The current thread is blocked at dispatch for 16 cycles. This intrinsic will not be reordered by the compiler. 


Table 7-133: Delay 16 Cycles At Dispatch 
Return/Argument Types Assembly Mapping 











none db16cyc 
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__db8cyc: Delay 8 Cycles at Dispatch 
(void) _ db8cyc() 


The current thread is blocked at dispatch for 8 cycles. This intrinsic will not be reordered by the compiler. 


Table 7-134: Delay 8 Cycles At Dispatch 
Return/Argument Types Assembly Mapping 
none db8cyc 














__dcbf: Data Cache Block Flush 

(void) _ dcbf (pointer) 

The cache block that contains the argument pointer is flushed and removed from the cache. 
The base and index arguments for the assembly mapping are calculated from pointer. 


Table 7-135: Data Cache Block Flush 
Return/Argument Types 





niet Assembly Mapping 








void* dcbf base, index 





__dcbst: Data Cache Block Store 


(void) _ dcbst (pointer) 


The cache block that contains the argument pointer is written to main memory. This intrinsic will not be reordered 
by the compiler. 


The base and index arguments for the assembly mapping are calculated from pointer. 


Table 7-136: Data Cache Block Store 
Return/Argument Types 





i Assembly Mapping 
pointer 








void* dcbst base, index 





__dcbt: Data Cache Block Touch 


(void) _ dcbt (pointer) 


The processor receives a hint that the cache block which contains the argument pointer will soon be loaded. This 
intrinsic will not be reordered by the compiler. 


The base and index arguments for the assembly mapping are calculated from pointer. 


Table 7-137: Data Cache Block Touch 
Return/Argument Types 





ad Assembly Mapping 
pointer 








void* dcbt base, index 
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__dcebt_TH1000: Start Streaming Data 
(void) _dcbt TH1000(EATRUNC, D, UG, ID) 





A stream is started with an id of ID and an effective address of EATRUNC. The argument D describes which direction 
the stream is going: true for forwards and false for backwards. The argument UG says if the stream is unlimited 
in bounds or not. This intrinsic will not be reordered by the compiler. 





The effective address for this instruction is calculated as: 
( (unsigned long long) EATRUNC) & ~Ox7F) | (((D & 1) << 6) | ((UG & 1) << 5) | (ID & OxF) 


The base and index arguments for the assembly mapping are calculated from the above effective address. 


Table 7-138: Start Streaming Data 











Return/Argument Types Assembly Mapping 
EATRUNC D UG ID 
void* bool bool int dcbt base, index, 8 





__dcbt_TH1010: Stop Streaming Data 
(void) _dcbt TH1010(G0, S, UNITCNT, T, U, ID) 


The processor receives a hint that the stream identified by ID will no longer be needed. If Go is set then the program 
will soon load from all nascent data streams that have been completely described, and it will probably no longer load 
from any other nascent data streams; all the rest of the arguments are ignored in this case. If Sis 10 then the 
stream associated with 7D will stop and all other arguments except for ID are ignored. If Sis 11 then all streams IDs 
are stopped and all other arguments are ignored. UNITCNT specifies the number of units in a data stream. T tells if 
the program’s need for each block of the data stream is likely to be transient. U tells if the data stream is unlimited 
and the UNITCNT argument is ignored. This intrinsic will not be reordered by the compiler. 


The effective address for this instruction is calculated as: 
(((unsigned long long) GO & 1) << 31) 


| ((S & 0x3) << 29) 

| ((UNITCNT & Ox3FF) << 7) 
| ((T & 1) << 6) 

| ((U & 1) << 5) 

| (ID & OxF) 


The base and index arguments for the assembly mapping are calculated from the above effective address. 


Table 7-139: Stop Streaming Data 











Return/Argument Types Assembly Mapping 
GO S UNITCNT T U ID 
bool int int bool bool int dcbt base, index, 10 
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__dcbtst: Data Cache Block Touch for Store 


(void) _ dcbtst (pointer) 


The processor receives a hint that the cache block that contains the argument pointer will soon be stored. This 
intrinsic will not be reordered by the compiler. 


The base and index arguments for the assembly mapping are calculated from pointer. 


Table 7-140: Data Cache Block Touch For Store 
Return/Argument Types 





a Assembly Mapping 
pointer 


void* dcbtst base, index 











__dcbz: Data Cache Block Set to Zero 


(void) _ dcbz (pointer) 


The cache block that contains the argument pointer is zeroed out. If the address is already in cache, the cache 
block containing it is zeroed. If the address was not already in a cache block, a cache block for it is created with all 
zeros. This intrinsic will not be reordered by the compiler. 


The base and index arguments for the assembly mapping are calculated from pointer. 


Table 7-141: Data Cache Block Set to Zero 
Return/Argument Types 





n Assembly Mapping 
pointer 








void* dcbz base, index 





__eieio: Enforce In-Order Execution of I/O 
(void) _eieio() 
A memory barrier is created, which provides an ordering function for the storage accesses caused by Load, Store, 
__dcbz(), eciwx(),and_ _ ecowx() instructions executed by the processor executing the _eieio() 


instruction. The memory barrier and ordering function are described in section 1.7.1 of PowerPC Architecture Book, 
Book II: PowerPC Virtual Environment Architecture, Version 2.02. 


Table 7-142: Enforce In-Order Execution of I/O 





Return/Argument Types Assembly Mapping 








none eieio 





__ fabs: Double Absolute Value 
d= _ fabs(a) 
The absolute value of the argument a is returned in d with the sign bit set to zero. 


Table 7-143: Double Absolute Value 











Return/Argument Types Assembly Mapping 
d a 
double double fabs d, a 
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__fabsf: Float Absolute Value 
d= _ fabsf(a) 


The absolute value of the argument a is returned in d with the sign bit set to zero. 


Table 7-144: Float Absolute Value 
Return/Argument Types 

d a 
float float fabs d, a 





Assembly Mapping 











__fcfid: Convert Doubleword to Double 
d= _ fcfid(a) 


The doubleword in a is converted to a floating-point and returned in d. 


Table 7-145: Convert Doubleword to Double 











Return/Argument Types Assembly Mapping 
d a 
double long long fcfid d, a 





__fctid: Convert Double to Doubleword 
d= _fctid(a) 
The double ais converted to a doubleword integer and returned in d. This function takes into account the current 


rounding mode. 


Table 7-146: Convert Double to Doubleword 











Return/Argument Types Assembly Mapping 
d a 
long long double fctid d, a 





__fctidz: Convert Double to Doubleword with Round Towards Zero 
d= __fetidz (a) 


The double a is converted to a doubleword integer and returned in d. This function always rounds towards zero. 


Table 7-147: Convert Double to Doubleword with Round Towards Zero 











Return/Argument Types Assembly Mapping 
d a 
long long double fctidz d, a 





__fctiw: Convert Double to 
d= _ fctiw(a) 
The double ais converted to a word integer and returned in d. This function takes into account the current 


rounding mode. 


Table 7-148: Convert Double to Word 











Return/Argument Types Assembly Mapping 
d a 
fctiw tmp, a 
int double stfiwx tmp, r1, tempspace 


lwzx d, r1, tempspace 
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__fctiwz: Convert Double to Word with Round Towards Zero 
d= _fctiwz(a) 


The double ais converted to a word integer and returned in d. This function always rounds towards zero. 


Table 7-149: Convert Double to Word with Round Towards Zero 











Return/Argument Types Assembly Mapping 
d a 
fctiwz tmp, a 
int double stfiwx tmp, r1, tempspace 


lwzx d, r1, tempspace 





__fmadd: Double Fused Multiply and Add 
d= __fmadd(a, b, c) 


The argument a is multiplied by the argument b, and the argument c is added to that product. The resulting value 
(axb+c) is returned in d. 


Table 7-150: Double Fused Multiply and Add 
Return/Argument Types 

d a b c 

double double double double fmadd d, a, b, c 





Assembly Mapping 











__fmadds: Float Fused Multiply and Add 
d = __fmadds(a, b, c) 
The argument a is multiplied by the argument b, and the argument c is added to that product. The resulting value 
(axb+c) is returned in d. 
Table 7-151: Float Fused Multiply and Add 
Return/Argument Types 


d a b c 
float float float float fmadds d, a, b, c 





Assembly Mapping 











__fmsub: Double Fused Multiply and Subtract 
d= __fmsub (a, b, c) 


The argument a is multiplied by the argument b, and the argument c is subtracted from that product. The resulting 
value (axb-c) is returned in d. 


Table 7-152: Double Fused Multiply and Subtract 
Return/Argument Types 


d a b C 
double double double double fmsub d, a, b, c 





Assembly Mapping 
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__fmsubs: Float Fused Multiply and Subtract 
d= __fmsubs(a, b, c) 


The argument a is multiplied by the argument b, and the argument c is subtracted from that product. The resulting 
value (axb-c) is returned in d. 


Table 7-153: Float Fused Multiply and Subtract 
Return/Argument Types 
d a b c 
float float float float fmsubs d, a, b, c 





Assembly Mapping 











__fmul: Double Multiply 
d= __fmul(a, b) 


The doubles a and b are multiplied, and their product (axb) is returned in d. 


Table 7-154: Double Multiply 


Return/Argument Types 
d a b 


double double double fmul d, a, b 





Assembly Mapping 











__fmuls: Float Multiply 
d= __fmuls (a, b) 


The floats a and b are multiplied, and their product (axb) is returned in a. 


Table 7-155: Float Multiply 
Return/Argument Types 
d a b 


float float float fmuls d, a, b 





Assembly Mapping 











__fnabs: Double Negative 
d= __fnabs (a) 


The negative absolute value of the argument a is returned in d. The sign bit is set to 1. 


Table 7-156: Double Negative 
Return/Argument Types 

d a 
double double fnabs d, a 





Assembly Mapping 











__fnabsf: Float Negative 
d= _ fnabsf (a) 


The negative absolute value of the argument a is returned in the d. The sign bit is set to 1. 


Table 7-157: Float Negative 
Return/Argument Types 

d a 
float float fnabs d, a 





Assembly Mapping 
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__fnmadd: Double Fused Negative Multiply and Add 
d= __fnmadd(a, b, c) 
The arguments a and b are multiplied, and the argument c is added to their product. The sum is negated, and the 


resulting value -(ax b+c) is returned in d. 


Table 7-158: Double Fused Negative Multiply and Add 











Return/Argument Types Assembly Mapping 
d a b c 
double double double double fnmadd d, a, b, c 





__fnmadds: Float Fused Negative Multiply and Add 
d= __fnmadds (a, b, c) 


The arguments a and b are multiplied, and the argument c is added to their product. The sum is negated, and the 
resulting value -(ax b+c) is returned in d. 


Table 7-159: Float Fused Negative Multiply and Add 
Return/Argument Types 
d a b c 
float float float float fnmadds d, a, b, c 





Assembly Mapping 











__fnmsub: Double Fused Negative Multiply and Subtract 
d = __fnmsub(a, b, c) 


The arguments a and b are multiplied, and the argument c is subtracted from their product. The sum is negated, 
and the resulting value -(axb-c) is returned in d. 


Table 7-160: Double Fused Negative Multiply and Subtract 
Return/Argument Types 

d a b c 

double double double double fnmsub d, a, b, c 





Assembly Mapping 











__fnmsubs: Float Fused Negative Multiply and Subtract 
d= __fnmsubs (a, b, c) 


The arguments a and b are multiplied, and the argument c is subtracted from their product. The sum is negated, 
and the resulting value -(axb-c) is returned in d. 


Table 7-161: Float Fused Negative Multiply and Subtract 
Return/Argument Types 

d a b c 

float float float float fnmsubs d, a, b, c 





Assembly Mapping 
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__fres: Float Reciprocal Estimate 
d= _fres(a) 


An estimate of the reciprocal of the argument a is returned in d. The estimate is correct to a precision of one part in 
256 of the reciprocal. 


Beyond this precision, the value is indeterminate; the results of executing this instruction may vary between 
implementations and between different executions on the same implementation. 


Table 7-162: Float Reciprocal Estimate 





Return/Argument Types 
d a 
float float fres d, a 


Assembly Mapping 











__frsp: Round to Single Precision 
d= _frsp(a) 


The argument a is rounded to single precision and returned in d. 


Table 7-163: Round to Single Precision 





Return/Argument Types 


Assembly Mapping 
d a 








float float frsp d, a 





__frsqrte: Double Reciprocal Square Root Estimate 
d= _frsqrte (a) 
An estimate of the reciprocal of the square root of the argument a is returned in d. 


The estimate is correct to a precision of one part in 32 of the reciprocal of the square root. Beyond this precision, 
the value is indeterminate; the results of executing this instruction may vary between implementations and between 
different executions on the same implementation. 


Table 7-164: Double Reciprocal Square Root Estimate 











Return/Argument Types Assembly Mapping 
d a 
float double frsqrte d, a 





__fsel: Floating-Point Select of Double 
d = _ fsel(a, b, c) 


The argument b is returned in dif the argument a is less than or equal to 0.0; otherwise c is returned. 


Table 7-165: Floating-Point Select of Double 











Return/Argument Types Assembly Mapping 
d a b c 
double double double double fsel d, a, b, c 
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__fsels: Floating-Point Select of Float 


d = _ fsels(a, b, c) 


The argument b is returned in dif the argument a is less than or equal to 0.0; otherwise c is returned. 


Table 7-166: Floating-Point Select of Float 
Return/Argument Types 
d a b c 
float float float float fsel d, a, b, c 





Assembly Mapping 











__fsqrt: Double Square Root 
d= _ fsqrt(a) 


The square root of the argument a is returned in a. 


Table 7-167: Double Square Root 











Return/Argument Types Assembly Mapping 
d a 
double double fsqrt d, a 





__fsqrts: Float Square Root 
d= _ fsqrts(a) 


The square root of the argument a is returned in d. 


Table 7-168: Float Square Root 
Return/Argument Types 
d a 
float float fsqrts d, a 





Assembly Mapping 











__icbi: Instruction Cache Block Invalidate 
(void) __icbi (pointer) 


The instruction cache block that contains the argument pointer is invalidated, if such a block is in the cache. This 
intrinsic will not be reordered by the compiler. 


The base and index arguments for the assembly mapping are calculated from pointer. 


Table 7-169: Instruction Cache Block Invalidate 





Return/Argument Types , 
, Assembly Mapping 
pointer 








void* icbi base, index 





__isync: Instruction Sync 
(void) __isync() 
The processor waits until all previous instructions have finished. The — isync() function ensures that all icbi 


have been performed. 


Table 7-170: Instruction Sync 





Return/Argument Types Assembly Mapping 








none isync 
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__Idarx: Load Doubleword with Reserved 
d = _ldarx (pointer) 


The reserved address of the processor is set to the value of pointer. A doubleword from the address in pointer 
is returned in d. 


The base and index arguments for the assembly mapping are calculated from pointer. 


This intrinsic might not be supported when compiling for 32-bit ABIs in which a 64-bit doubleword is maintained in 
two separate registers. 


Table 7-171: Load Doubleword with Reserved 











Return/Argument Types , 
A Assembly Mapping 
d pointer 
unsigned long long void* Idarx d, base, index 





__Idbrx: Load Reversed Doubleword 
d = _ldbrx (pointer) 
A doubleword from the address in pointer is loaded in reversed endian order into d and returned. 


The base and index arguments for the assembly mapping are calculated from pointer. 


Table 7-172: Load Reversed Doubleword 











Return/Argument Types Assembly Mapping 
d pointer 64-bit ABI 32-bit ABI 
unsigned long long void* Idbrx d, base, index lwbix duis base; index 


lwbrx d_hi, base, index+4 





__Ihbrx: Load Reversed Halfword 
d = __lhbrx (pointer) 
A halfword from the address in pointer is loaded in reversed endian order into d and returned. 


The base and index arguments for the assembly mapping are calculated from pointer. 


Table 7-173: Load Reversed Halfword 











Return/Argument Types Assembly Mapping 
d pointer 
unsigned short void* lhbrx d, base, index 





__lwarx: Load Word with Reserved 
d= _lwarx (pointer) 


The reserved address of the processor is set to the value of pointer. A word from the address in pointer is 
returned in d. 


The base and index arguments for the assembly mapping are calculated from pointer. 


Table 7-174: Load Word with Reserved 











Return/Argument Types , 
. Assembly Mapping 
d pointer 
unsigned void* lwarx d, base, index 
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__lwbrx: Load Reversed Word 
d = __lwbrx (pointer) 
A word from the address in pointer is loaded in reversed endian order into d. 
The base and index arguments for the assembly mapping are calculated from pointer. 
Table 7-175: Load Reversed Word 


Return/Argument Types 
d pointer 





Assembly Mapping 








unsigned void* lwbrx d, base, index 





__lwsync: Light Weight Sync 
(void) _ lwsync() 
A memory barrier is created, providing an ordering function for the storage accesses caused by prior Load, Store, 
and __dcbz () instructions that are executed by the processor executing _lwsync (). The memory barrier and 
ordering function are described in section 1.7.1 of PowerPC Architecture Book, Book II: PowerPC Virtual 
Environment Architecture, Version 2.02. 
Table 7-176: Light Weight Sync 

Return/Argument Types Assembly Mapping 











none Iwsync 





__mffs: Move From Floating-Point Status and Control Register 
d= _mffs() 


The current Floating-Point Status and Control Register is returned in d. This intrinsic will not be reordered by the 
compiler. 


Table 7-177: Move From Floating-Point Status and Control Register 





Return/Argument Types 
d 
double mffs d 


Assembly Mapping 











__mfspr: Move From Special Purpose Register 
d = _mfspr (spr) 


The contents of the special purpose register specified by spr are returned in d. This intrinsic will not be reordered 
by the compiler. 


This intrinsic might not be supported when compiling for 32-bit ABIs in which a 64-bit doubleword is maintained in 
two separate registers. 


Table 7-178: Move From Special Purpose Register 





Return/Argument Types Assembly Mapping 


d spr 








unsigned long long 10-bit literal unsigned int  mfspr d, spr 
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__mftb: Move From Time Base 
d= mftb() 


The time base register is returned in d. This intrinsic will not be reordered by the compiler. 


Table 7-179: Move From Time Base 


Return/Argument Types Assembly Mapping 
d 64-bit ABI 32-bit ABI 
retry: 
mftbu d_hi 
mftb d_lo 
mftbu tmp 
cmp d_hi, tmp 
bne retry 











unsigned long long mftb d 





__mtfsb0: Set Field of FPSCR 
(void) __mtfsb0 (bt) 
Bit bt of Floating-Point Status and Control Register (FPSCR) is set to 0. This intrinsic will not be reordered by the 


compiler. It will also cause a barrier for floating-point operations. 


Table 7-180: Set Field of FRSCR 











Return/Argument Types Assembly Mappihg 
bt 
5-bit unsigned int (literal) mtfsbO bt 





__mtfsb1: Unset Field of FPSCR 
(void) _ mtfsb1 (bt) 


Bit bt of Floating-Point Status and Control Register is set to 1. This intrinsic will not be reordered by the compiler. It 
will also cause a barrier for floating-point operations. 


Table 7-181: Unset Field of FPSCR 
Return/Argument Types 
bt 
5-bit unsigned int (literal) mtfsb1 bt 





Assembly Mapping 











__mtfsf: Set Fields in FPSCR 
(void) _ mtfsf(flm, b) 
The fields of Floating-Point Status and Control Register are set to b masked by the argument f1m. This intrinsic will 


not be reordered by the compiler. It will also cause a barrier for floating-point operations. 


Table 7-182: Set Fields in FPSCR 











Return/Argument Types Assembly Mapping 
flm b 
8-bit unsigned int (literal) | double mtfsf flm, b 
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__mtfsfi: Set Field FPSCR From Other Field 
(void) _mtfsfi(bf, u) 


The u field of Floating-Point Status and Control Register is copied into the bf field of FPSCR. This intrinsic will not 
be reordered by the compiler. It will also cause a barrier for floating-point operations. 


Table 7-183: Set Field FPSCR From Other Field 
Return/Argument Types 
bf u 
3-bit unsigned int (literal) 4-bit unsigned int (literal) mtfsfi bf, u 





Assembly Mapping 











__mtspr: Move to Special Purpose Register 
(void) __mtspr(spr, value) 


The special purpose register specified by spr is set to the argument value. This intrinsic will not be reordered by 
the compiler. 


This intrinsic might not be supported when compiling for 32-bit ABIs in which a 64-bit doubleword is maintained in 
two separate registers. 


Table 7-184: Move to Special Purpose Register 
Return/Argument Types 





Assembly Mapping 
spr value 


10-bit unsigned int (literal) | unsigned long long mtspr spr, value 











__mulhd: Multiply Doubleword, High Part 
da = _ mulhd(a, b) 
The high part of the signed product of the doubleword arguments a and b is returned in d. 


This intrinsic might not be supported when compiling for 32-bit ABIs in which a 64-bit doubleword is maintained in 
two separate registers. 


Table 7-185: Multiply Doubleword, High Part 











Return/Argument Types Assembly Mapping 
d a b 
long long long long long long mulhd d, a, b 





__mulhdu: Multiply Double Unsigned Word, High Part 
d = __mulhdu(a, b) 
The high part of the unsigned product of the doubleword arguments a and b is returned in d. 


This intrinsic might not be supported when compiling for 32-bit ABIs in which a 64-bit doubleword is maintained in 
two separate registers. 


Table 7-186: Multiply Double Unsigned Word, High Part 











Return/Argument Types Assembly Mapping 
d a b 
unsigned long long unsigned long long unsigned long long mulhdu d, a, b 
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__mulhw: Multiply Word, High Part 
d= _ mulhw(a, b) 


The high part of the signed product of the word arguments a and b is returned in d. 


Table 7-187: Multiply Word, High Part 
Return/Argument Types 
d a b 


int int int mulhw d, a, b 





Assembly Mapping 











__mulhwu: Multiply Unsigned Word, High Part 
d = _mulhwu(a, b) 


The high part of the unsigned product of the word arguments a and bis returned in d. 


Table 7-188: Multiply Unsigned Word, High Part 











Return/Argument Types Assembly Mapping 
d a b 
unsigned int unsigned int unsigned int mulhwu d, a, b 





_ hop: No Operation 
(void) __nop() 
The preferred nop instruction is generated. This intrinsic will not be reordered by the compiler. 


Table 7-189: No Operation 
Return/Argument Types Assembly Mapping 











none nop 





_—ridci: Rotate Left Doubleword then Clear Left 
d = _rldcl(a, b, mb) 


The value in the argument a is rotated leftwards by the number of bits specified by the argument b. A mask is 
generated having 1-bits from bit mb through bit 63, and 0-bits elsewhere. The rotated data ANDed with the 
generated mask is returned into d. 


This intrinsic might not be supported when compiling for 32-bit ABIs in which a 64-bit doubleword is maintained in 
two separate registers. 


Table 7-190: Rotate Left Doubleword then Clear Left 











Return/Argument Types Assembly Mapping 
d a b mb 
unsigned long long unsigned long long unsigned long long 6-bit unsigned int (literal) rldcl d, a, b, mb 
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__rldcr: Rotate Left Doubleword then Clear Right 
d= _rldcr(a, b, me) 
The value in the argument a is rotated leftwards by the number of bits specified by the argument b. A mask is 


generated having 1-bits from bit O though bit me and 0-bits elsewhere. The rotated data ANDed with the generated 
mask is returned in d. 


This intrinsic might not be supported when compiling for 32-bit ABIs in which a 64-bit doubleword is maintained in 
two separate registers. 


Table 7-191: Rotate Left Doubleword then Clear Right 
Return/Argument Types 
d a b me 
unsigned long long unsigned long long unsigned long long 6-bit unsigned int (literal) rldcr d, a, b, me 





Assembly Mapping 











__rldic: Rotate Left Doubleword Immediate then Clear 
d= _ridic(a, sh, mb) 
The value in the argument a is rotated leftwards by the number of bits specified by the argument sh. A mask is 


generated having 1-bits from bit mb through bit 63-.sh and 0-bits elsewhere. The rotated data ANDed with the 
generated mask is returned in d. 


This intrinsic might not be supported when compiling for 32-bit ABIs in which a 64-bit doubleword is maintained in 
two separate registers. 


Table 7-192: Rotate Left Doubleword Immediate then Clear 
Return/Argument Types 
d a sh mb 
unsigned long long unsigned long long 6-bit unsigned int (literal) | 6-bit unsigned int (literal) ridic d, a, sh, mb 





Assembly Mapping 











__rldicl: Rotate Left Doubleword Immediate then Clear Left 
d= _ rldicl(a, sh, mb) 
The value in the argument a is rotated leftwards by the number of bits specified by the argument sh. A mask is 


generated having 1-bits from bit mb through bit 63 and O-bits elsewhere. The rotated data ANDed with the generated 
mask is returned in d. 


This intrinsic might not be supported when compiling for 32-bit ABIs in which a 64-bit doubleword is maintained in 
two separate registers. 


Table 7-193: Rotate Left Doubleword Immediate then Clear Left 
Return/Argument Types 
d a sh mb 
unsigned long long unsigned long long 6-bit unsigned int (literal) 6-bit unsigned int (literal) ridicl d, a, sh, mb 





Assembly Mapping 
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_ rldicr: Rotate Left Doubleword Immediate then Clear Right 
d= _ rlidicr(a, sh, me) 
The value in the argument a is rotated leftwards by the number of bits specified by the argument sh. A mask is 


generated having 1-bits from bit O though bit me and 0-bits elsewhere. The rotated data ANDed with the generated 
mask is returned in d. 


This intrinsic might not be supported when compiling for 32-bit ABIs in which a 64-bit doubleword is maintained in 
two separate registers. 


Table 7-194: Rotate Left Doubleword Immediate then Clear Right 
Return/Argument Types 
d a sh me 
unsigned long long unsigned long long — 6-bit unsigned int (literal) 6-bit unsigned int (literal) — ridicr d, a, sh, me 





Assembly Mapping 











__rldimi: Rotate Left Doubleword Immediate then Mask Insert 
d= _rldimi(a, b, sh, mb) 


A mask is generated with 1-bits from bit mb through bit 63-.sh, and 0-bits elsewhere. The value in a is ANDed with 
the complement of this mask, zeroing out just the bits inside the range mb through 63-sh. The argument b is rotated 
left by sh bits and ANDs the result with the mask, zeroing out all bits outside the range mb through 63-sh. The two 
masked values are combined together with inclusive OR, and returned in c. 


This intrinsic might not be supported when compiling for 32-bit ABIs in which a 64-bit doubleword is maintained in 
two separate registers. 


Table 7-195: Rotate Left Doubleword Immediate then Mask Insert 











Return/Argument Types Assembly Mapping 
d a b sh mb 
unsigned long long unsigned long long unsigned long long ia a ia im a b. shi mb 





__rlwimi: Rotate Left Word Immediate then Mask Insert 
d= _ rlwimi(a, b, sh, mb, me) 


A mask is generated with 1-bits from bit mb through bit me, and 0-bits elsewhere. The value in a is ANDed with the 
complement of this mask, zeroing out just the bits inside the range mb through me. The argument b is rotated left by 
sh bits and ANDs the result with the mask, zeroing out all bits outside the range mb through me. The two masked 
values are combined together with inclusive OR, and returned in d. 


Table 7-196: Rotate Left Word Immediate then Mask Insert 











Return/Argument Types Assembly Mapping 
d a b sh mb me 
; : ; ; ; __, 5-bit unsigned int 5-bit unsigned int 5-bit unsigned int mr d, a 
unsigned int unsigned int unsigned int diteral) (literal) (literal) rlwimi d, b, sh, mb, me 
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__rlwinm: Rotate Left Word Immediate then AND With Mask 
d = __rlwinm(a, sh, mb, me) 


A mask is generated with 1-bits from mb through bit me, and 0-bits elsewhere. The value in a is rotated left by sh 
bits, then ANDed with this mask, and returned in ad. 


Table 7-197: Rotate Left Word Immediate then AND With Mask 
Return/Argument Types 
d a sh mb me 


unsigned int unsigned int 5-bit unsigned int 5-bit unsigned int 5-bit unsigned int rlwinm d, a, sh, mb, me 
(literal) (literal) (literal) 





Assembly Mapping 











__rlwnm: Rotate Left Word then AND With Mask 
da = __rlwnm(a, b, mb, me) 


The argument a is rotated leftwards by the argument b. A mask is generated having 1-bits from bit mb through bit me, 
and 0-bits elsewhere. The rotated data ANDed with the generated mask is returned in d. 


Table 7-198: Rotate Left Word then AND With Mask 
Return/Argument Types 
d a b mb me 


unsigned int unsigned int | unsigned int Te int oa int 





Assembly Mapping 








rlwnm d, a, b, mb, me 





__setflm: Save and Set the FPSCR 
d= __setflm(a) 
The Floating-Point Status and Control Register is set to a, and the context of that register is returned in b. This 


intrinsic will not be reordered by the compiler. It will also cause a barrier for floating-point operations. 


Table 7-199: Save and Set the FPSCR 











Return/Argument Types Assembly Mapping 
d a 
mffs d; 
double double mtfst OxEF, a 





__Stdbrx: Store Reversed Doubleword 
(void) __stdbrx (pointer, b) 
The argument b is stored in reversed endian order into the doubleword located at the argument pointer. 


The base and index arguments for the assembly mapping are calculated from pointer. 


Table 7-200: Store Reversed Doubleword 











Return/Argument Types Assembly Mapping 
pointer b 64-bit ABI 32-bit ABI 
at . . stwbrx b_lo, base, index 
void unsigned long long stdbrx b, base, index stwbrx b_hi, base, index+4 
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__stdcx: Store Doubleword Conditional 
d = __stdcx(pointer, b) 


If the reserved address of the processor is the value in the argument pointer, bis stored into the doubleword at 
the argument pointer, and the value of 1 is returned in d. Otherwise, the store is not performed, and the value of 0 
is returned in d. 


The base and index arguments for the assembly mapping are calculated from pointer. 
The instruction stdcx. returns its value in cr0.eq, the equals field of conditional register 0. 


This intrinsic might not be supported when compiling for 32-bit ABIs in which a 64-bit doubleword is maintained in 
two separate registers. 


Table 7-201: Store Doubleword Conditional 











Retumn/Argument Types Assembly Mapping 
d pointer b 
bool void* unsigned long long stdcx. b, base, index; d = cr0.eq 





__sthbrx: Store Reversed Halfword 
(void) __sthbrx (pointer, b) 
The argument b is stored in reversed endian order into the halfword located at the argument pointer. 


The base and index arguments for the assembly mapping are calculated from pointer. 


Table 7-202: Store Reversed Halfword 











Return/Argument Types Assembly Mapping 
pointer b 
void* unsigned short sthbrx b, base, index 





__stwbrx: Store Reversed Word 
(void) __stwbrx (pointer, b) 
The argument b is stored in reversed endian order into the word located at the argument pointer. 


The base and index arguments for the assembly mapping are calculated from pointer. 


Table 7-203: Store Reversed Word 











Return/Argument Types Assembly Mapin 
pointer b y vapping 
void* unsigned stwbrx b, base, index 
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__stwex: Store Word Conditional 


d = __stwcx (pointer, b) 


If the reserved address of the processor is the value in the argument pointer, bis stored into the word at the 
argument pointer, and the value of 1 is returned in d. Otherwise, the store is not performed, and the value of 0 is 
returned in d. 


The base and index arguments for the assembly mapping are calculated from pointer. 


The instruction stwcx. returns its value in cr0.eq, the equals field of conditional register 0. 


Table 7-204: Store Word Conditional 














Return/Argument Types Assembly Mapping 
d pointer b 
bool void* unsigned stwcx. b, base, index; d = cr0.eq 
__ syne: Sync 
(void) __sync() 


A memory barrier is created, providing an ordering function for all instructions executing on the same processor. 
The memory barrier and ordering function are described in section 1.7.1 of PowerPC Architecture Book, Book II: 
PowerPC Virtual Environment Architecture, Version 2.02. 
Table 7-205: Sync 

Return/Argument Types _ Assembly Mapping 











none sync 
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8. SPU C and C++ Standard Libraries and Language Support 


This chapter describes differences between the implementations of the C and C++ standard libraries on the SPU 
and the corresponding ISO/IEC standards. It also identifies common language features that are specifically not 
supported on the SPU. 


8.1. Standard Libraries 


The C and C++ standard libraries that are required for the SPU are based on the Standard C Library described in 
ISO/IEC Standard 9899:1999 and the C++ Standard Library described in ISO/IEC Standard 14882:1998. However, 
neither library must be a fully compliant implementation of the respective ISO/IEC standard. 


The proposed differences from ISO/IEC compliant implementations are due to two reasons: 1) The SPU does not 
have the same system resources and operating system support that are available to most stand-alone processors; 
and 2) the SPU hardware doesn't fully support the IEEE floating-point standard. Because of the SPU's limited 
operating system support, library functions that require system calls, thread facilities, and file input/output (1/0) may 
not be supported. Because of differences in floating-point behavior, the results of single-precision floating-point 
functions will probably be less accurate than defined by the Standard, and floating-point exceptions will be less 
reliable. Nevertheless, the standard library functions that are provided should execute fast, in most cases. 


The minimum C and C++ library features that must be provided for the SPU are described in the following sections. 


8.1.1. C Standard Library 


This section describes the minimum requirements of a compliant C standard library implementation. 


Library Contents 


All of the entities required in the C standard library must be declared and defined within the library header files listed 
in Table 8-206. Differences between the contents of these header files and the header files that comprise the ISO 
Standard Library are identified in the table. For a detailed description of the particular entities, see the ISO/IEC C 
Standard listed in the “Related Documentation” section. 


Table 8-206: C Library Header Files 





Header Name Description 


Enforce assertions when functions execute. The assert macro reports assertion failures 





asserth using the special debug printf (described below). 

complex.h Perform complex arithmetic. 

ctype.h Classify characters. The functions declared in this header use only the “C” locale. 

errno.h Test error codes reported by library functions. 

fenv.h Control IEEE style floating-point arithmetic. Macros for single- and double-precision 
exceptions are described in “9.2.2. Floating-Point Exceptions”. 

float.h Test floating-point type properties. These properties are specified in section “9.1. Properties 
of Floating-Point Data Type Representations”. 

inttypes.h Convert various integer types. 

iso646.h Program in ISO 646 variant character sets. 

limits.h Test integer type properties. The macro MB_LEN_ MAX is defined as 1. 

locale.h Not available. 


Compute common mathematical functions. The floating-point behavior of these functions will 
adhere to the specifications described in section “9.3. Floating-Point Operations”. Although 
math.h not specified or required, corresponding vector versions of the math functions may be added 
to the library to take advantage of the many high-performance SIMD (single instruction, 
multiple data) instructions provided by the SPU hardware. 
setjmp.h Execute nonlocal goto statements. 


signal.h Not available. 
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Header Name 


stdarg.h 
stdbool.h 
stddef.h 


stdint.h 


stdio.h 


stdlib.h 
string.h 
tgmath.h 


time.h 
wchar.h 
wctype.h 


Description 


Access a varying number of arguments. 
Define a convenient Boolean type name and constants. 
Define several useful types and macros. The wchar_t is not defined. 


Define various integer types with size constraints. SIG_ATOMIC_MAX and SIG ATOMIC MIN 
are not defined, nor are any of the WCHAR_MAX, WCHAR MIN, WINT MAX, and WINT MIN. 


Not available, except for printf, which is provided for debugging. (See section “Debug 
printf()”.) 

Perform a variety of operations. The functions getenv, mblen, mbstowcs, mbtowc, 
system, wcstombs, and wctomb are not defined. The type wchar_t and the macro 
MB _CUR_MAX are also not defined. 


Manipulate several kinds of strings. The function strxfrm uses only the “C” locale. 


Declare various type-generic math functions. Single-precision functions declared in this 
header adhere to the same specifications described for the corresponding functions that are 
declared in math.h. 


Not available. 
Not available. 
Not available. 





Debug printf() 

A printf () function will be provided for application debugging. The implementation of this function depends on the 
particular services provided by the underlying operating system. Although detailed specifications for this function are 
not mandated by this document, a full-featured implementation is recommended. Such an implementation would 
include all of the usual output format conversion specifiers required by the C standard. In addition, conversion 
specifiers of the type described in the AltiVec Technology Programming Interface Manual are recommended to 
handle vector output formatting. Output conversion specifiers take the following form: 


where 


%[<flags>] [<width>] [<precision>] [<size>]<conversion> 


<flags> ::= <flag-char> | <flags><flag-char> 

<flag-char> ::= <std-flag-char> | <c-sep> 

<std-flag-char> fea tst Tee j toi j Fee ra 

<c-sep> Sata ae ae et 

<width> ::= <decimal-integer> | '*' 

<precision> m= Tae -<width> | Jer ove! 

<size> :i= ‘hh’ | Ch? | '1' | '11l' | 'L' | <vector-size> 

<vector-size> ::= 'v' | ‘whh’ | 'vh' | 'vl' | 'vll' | 'vL' | ‘“‘hhv’ 
| 'hv' | ‘lv'| 'llv' | 'Lv' 


<conversion> ::= <char-conv> | <str_conv> | <fp-conv> | <int-conv> 


| <byte-conv> | <misc-conv> 





<char-conv> co Ce 

<str-conv> i= ‘Ss! 

<fp-conv> Dis tet | JET | TEN p SEPE a] hg |. BG 
<int-conv> H= tat | Tit i tato tpt i Tot [xt | ae 
<byte-conv> ::= 'uc' | 'co' | 'cx' | 'cX' 

<misc-conv> coe “Snr [hes 


Extensions to the C standard output conversion specification are shown in bold for vector types. Vector types are 
formatted using the conversions shown in Table 8-207. String conversions (<st r-conv>) and miscellaneous 
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conversions (<misc-conv>) are not defined for vectors. The ‘p’ integer conversion (<int-conv>) is also not 
defined. The default separator (<c-sep>) is a space, except for character conversion (<char-conv>), which has 
no separator. 


Table 8-207: Vector Formats 





Vector Size Conversion Description 


A vector is printed as a vector char, consisting of 16 one-byte elements. The ‘c’ 
conversion prints contiguous ASCII characters. 


With the ‘uc’ conversion, a vector is printed as a vector unsigned char, 
consisting of 16 one-byte elements. Similarly, the ‘co’, ‘cx’, and ‘cX’ conversions 


V <char-conv> 


<int-conv> print either a vector unsigned char or a qword, in octal format or in hexadecimal 
v : À PAR h : 
<byte-conv> format. For all other integer conversions, a vector is printed in the respective 
octal (0), integer (d, i, u) or hexadecimal (x, X) format, either as a vector 
unsigned int or as a vector signed int, consisting of 4 four-byte elements. 
A vector is printed in a signed decimal fractional representation, either in 
y <fp-conv> standard decimal notation (f or F) or with a decimal power-of-ten exponent (e, 


E, g, G). The representation is printed as a vector float, containing 4 four-byte 
elements. 


A vector is printed in the respective octal (0), integer (d, i, u), or hexadecimal (x, 
vhh or hhv <int-conv> X) format, either as a vector unsigned char or as a vector signed char, 
consisting of 16 one-byte elements. 


A vector is printed in the respective octal (o), integer (d, i, u), or hexadecimal (x, 
vh or hv <int-conv> X) format, either as a vector unsigned short or as a vector signed short, 
consisting of 8 two-byte elements. 


A vector is printed in the respective octal (0), integer (d, i, u), or hexadecimal (x, 
vl or lv <int-conv> X) format, as a vector unsigned int or as a vector signed int, consisting of 4 
four-byte elements. 


A vector is printed in the respective octal (0), integer (d, i, u), or hexadecimal (x, 
vli or llv <int-conv> X) format, as a vector unsigned long long or as a vector signed long long, 
consisting of 2 eight-byte elements. 


A vector is printed in a signed decimal fractional representation, either in 
standard decimal notation (f or F) or with a decimal power-of-ten exponent (e, 
E, g, G). The representation is printed as a vector double, consisting of 2 eight- 
byte elements. 


vL or Lv <fp-conv> 





Malloc Heap 

The malloc heap is defined to begin at _end and to extend to the end of the stack. The memory heap may be 
enlarged by a heap-extending function. This function would negatively adjust the Available Stack Size element of 
the current Stack Pointer Information register and all Available Stack Sizes residing in the saved SP registers found 
in the sequence of Back Chain quadwords. 


Whenever the malloc heap is enlarged, code should verify that the enlarged malloc heap does not extend into 
the currently used stack. If it does, the operation should fail. 


Implementations of set jmp/long jmp are also affected by the use of heap-extending functions. When restoring the 

Stack Pointer Information register as a result of invoking the longjmp function, the function must detect any change 
to the Available Stack Size between set jmp and long mp, and it must correct the saved Stack Pointer Information 
register. For example: 


SP.avail_ stack size = SP set.stack_ ptr - SP.stack ptr + 
SP.avail_ stack size; 


where SP is the current Stack Pointer Information register, and SP_ set is the Stack Pointer Information register 
saved at the last set jmp call. 
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This section describes the minimum contents of the C++ standard library. 


As with the C library, the C++ library header files declare or define the contents of the C++ library. Table 8-208 lists 
the header files that comprise the core of the C++ standard library. Differences between the contents of the C++ 
header files and the header files that comprise the ISO Standard Library are noted in this table. 


Table 8-208: C++ Library Header Files 





Header Name 
algorithm 
bitset 
complex 
deque 
exception 
fstream 
functional 
iomanip 
ios 

iosfwd 
iostream 
istream 
iterator 
limits 

list 

locale 
map 
memory 
new 
numeric 
ostream 
queue 
set 

slist 
sstream 
stack 
stdexcept 
streambuf 
string 
strstream 
typeinfo 
utility 
valarray 
vector 


Description 

Define numerous templates that implement useful algorithms. 
Define a template class that administers sets of bits. 

Define a template class that supports complex arithmetic. 
Define a template class that implements a deque container. 
Not available. 

Not available. 


Define several templates that help construct predicates for the templates defined in algorithm 
and numeric. 


Not available. 

Not available. 

Not available. 

Not available. 

Not available. 

Define several templates that help define and manipulate iterators. 

Test numeric type properties. 

Define a template class that implements a doubly linked list container. 

Not available. 

Define template classes that implement associative containers that map keys to values. 
Define several templates that allocate and free storage for various container classes. 
Declare several functions that allocate and free storage. 

Define several templates that implement useful numeric functions. 

Not available. 

Define a template class that implements a queue container. 

Define template classes that implement associative containers. 

Define a template class that implements a singly linked list container. 

Not available. 

Define a template class that implements a stack container. 

Not available. 

Not available. 

Define a template class that implements a string container. 

Not available. 

Not available. 

Define several templates of general utility. 

Define several classes and template classes that support value-oriented arrays. 
Define a template class that implements a vector container. 





The C++ standard library contains new-style C++ header files that correspond to 12 traditional C header files. Both 
the new-style and the traditional-style header files are included in the library. These header files are listed in Table 


8-209. 
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Table 8-209: New and Traditional C++ Library Header Files 


SPU C and C++ Standard Libraries and Language Support 





New-Style Header Name 
cassert 
cctype 
cerrno 
cfloat 
ciso646 
climits 
clocale 
cmath 
csetjmp 
csignal 
cstdarg 
cstddef 
cstdio 
cstdlib 
cstring 
ctime 
cwchar 
cwctype 


Traditional Header Name 
assert.h 
ctype.h 
errno.h 
float.h 
iso646.h 
limits.h 
locale.h 
math.h 
setjmp.h 
signal.h 
stdarg.h 
stddef.h 
stdio.h 
stdlib.h 
string.h 
time.h 
wchar.h 
wctype.h 


Description 

Enforce assertions when functions execute. | 
Classify characters." 

Test error codes reported by library functions.’ 
Test floating-point type properties. 

Program in ISO 646 variant character sets. 
Test integer type properties.’ 

Not available. 

Compute common mathematical functions.’ 
Execute nonlocal goto statements. 

Not available. 

Access a varying number of arguments. 
Define several useful types and macros." 
Not available. 

Perform a variety of operations." 

Manipulate several kinds of strings.’ 

Not available. 

Not available. 

Not available. 





1 See Table 8-206: C Library Header Files, for specific implementation limitations. 


Non-Supported Language Features 


C and C++ implementations should comply with the language features prescribed in the respective ISO/IEC 


standards, as much as possible. 
architecture limitations. Below is a list of non-supported features: 


e C++ exception handling 


However, certain features are specifically not supported because of SPU 
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9. Floating-Point Arithmetic on the SPU 


9.1. 


Annex F of the C99 language standard (ISO/IEC 9899) specifies support for the IEC 60559 floating-point standard. 
This chapter describes differences from Annex F and ISO/IEC Standard 60559 that apply to SPU compilers and 
libraries. 


Floating-point behavior is essentially dictated by the SPU hardware. For single precision, the hardware provides an 
extended single-precision number range. Denorm arguments are treated as 0, and NaN and Infinity are not supported. 
The only rounding mode that is supported is truncation (round towards 0), and exceptions apply only to certain 
extended range floating-point instructions). For double precision, the hardware provides the standard IEEE number 
range, but again, denorm arguments are treated as 0. IEEE exceptions are detected and accumulated in the FPSCR 
register, and the IEEE rules for propagation of NaNs are not implemented in the architecture. (For details, see the 
Synergistic Processor Unit Instruction Set Architecture.) These and other IEEE differences affect almost every 
aspect of floating-point computation, including data-type properties, rounding modes, exception status, error reporting, 
and expression evaluation. The particular effect of these differences on the compiler and libraries are described in the 
following sections. 


Properties of Floating-Point Data Type Representations 


The properties of floating-point data type representations are declared as macros in float .h. Table 9-210 lists these 
macros and the corresponding values that are applicable for the SPU. 


Table 9-210: Values for Floating-Point Type Properties 





Macro Value 

FLT_DIG 6 

FLT_EPSILON Ox1p-23f (1.19209290E-07f) 
FLT_MANT_DIG 24 

FLT_MAX_10_EXP 38 

FLT_MAX_EXP 129 

FLT_MIN_10_EXP -37 

FLT_MIN_EXP -125 

FLT_MAX 0x1.FFFFFEp128f (6.80564694E+38f) 
FLT_MIN Ox1p-126f (1.17549436E-38f) 
FLT_ROUNDS Initialized to 16 (to nearest for both elements) 
FLT_EVAL_METHOD 0 (no promotions occur) 

FLT_RADIX 2 

DBL_DIG 15 

DBL_EPSILON 0x1p-52 (2.2204460492503131E-016) 
DBL_MANT_DIG 53 

DBL_MAX_10_EXP 308 

DBL_MAX_EXP 1024 

DBL_MIN_10_EXP -307 

DBL_MIN_EXP -1021 

DBL_MAX 0x1.FFFFFFFFFFFFFp1023 (1.7976931348623157E+308) 
DBL_MIN 0x1p-1022 (2.2250738585072014E-308) 
DECIMAL_DIG 17 
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9.2. Floating-Point Environment 


The macros defined within fenv .h control the directed-rounding control mode and floating-point exception status 
flags for floating-point operations. 


9.2.1. Rounding Modes 


Whereas the C language specification requires that all floating-point data types use the same rounding modes, the 
SPU hardware supports different rounding modes for single- and double-precision arithmetic. On the SPU, the 
rounding mode for single precision is round-towards-zero, and the default rounding mode for double precision is 
round-to-nearest. 


According to the C99 standard, the rounding mode for floating-point addition is characterized by the implementation- 
defined value of FLT_ROUNDS. On the SPU, this macro is only used for double precision. Single-precision rounding 
mode is always truncation. (See Table 9-210.) 


FLT ROUNDS will return a 5-bit value which represents the rounding mode for both double precision elements. The 
highest bit is always 1. The next two bits are the rounding mode for element 0 and the two lowest bits are the 
rounding mode for element 1. Table 9-211 lists the rounding mode represented by the two bits for each element. 


Table 9-211: Rounding Mode for Two Bits of FLT_ ROUNDS 





Last Two Bits Rounding Mode 

00 Round to nearest even 

01 Round toward zero (truncate) 
10 Round toward +infinity 

11 Round towards -infinity 


Because the SPU hardware only supports rounding towards zero for single precision, some single-precision math 
functions will necessarily deviate from the C99 standard. The standard library math functions and macros that deviate 
are described later, in section “9.3.2. Overall Behavior of C Operators and Standard Library Math Functions”. 


Table 9-212 lists the macros that can be used to set the double precision rounding modes for element 0 and element 
1. The macros for element 0 and element 1 may be used together with a bitwise OR to set the rounding mode for 
both elements, or the macros can be used separately to set the rounding mode for only that element. 


Table 9-212: Macros for Double Precision Rounding Modes 





Macro Comment 

FE_TONEAREST Set element 0 to round to nearest even 
FE_TOWARDZERO Set element 0 to round towards zero 
FE_UPWARD Set element 0 to round towards +infinity 
FE_DOWNWARD Set element 0 to round towards —infinity 
FE_TONEAREST_1 Set element 1 to round to nearest even 
FE_TOWARDZERO_1 Set element 1 to round towards zero 
FE_UPWARD_1 Set element 1 to round towards +infinity 
FE_DOWNWARD_1 Set element 1 to round towards -infinity 


9.2.2. Floating-Point Exceptions 


Table 9-213 and Table 9-214 list the macros for floating-point exceptions that will be defined in fenv.h. Because of 
the restricted behavior of the SPU floating-point hardware, single-precision library functions can have an undefined 
effect on these exception flags. Moreover, hardware traps will not result from any raised exception. 
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Table 9-213: Macros for Single Precision Floating-Point Exceptions 





Macro 
FE_OVERFLOW_SNGL 
FE_UNDERFLOW_SNGL 
FE_DIFF_SNGL 
FE_DIVBYZERO_SNGL 
FE_OVERFLOW_SNGL_1 
FE_UNDERFLOW_SNGL_1 
FE_DIFF_SNGL_1 
FE_DIVBYZERO_SNGL_1 
FE_OVERFLOW_SNGL_2 
FE_UNDERFLOW_SNGL_2 
FE_DIFF_SNGL_2 
FE_DIVBYZERO_SNGL_2 
FE_OVERFLOW_SNGL_3 
FE_UNDERFLOW_SNGL_3 
FE_DIFF_SNGL_3 
FE_DIVBYZERO_SNGL_3 
FE_ALL_EXCEPT_SNGL 
FE_ALL_EXCEPT_SNGL_1 
FE_ALL_EXCEPT_SNGL_2 
FE_ALL_EXCEPT_SNGL_3 
FE_ALL_EXCEPT_SNGL_3 


Comment 

Overflow exception for element 0 
Underflow exception for element 0 

Different from IEEE exception for element 0 
Divide by zero exception for element 0 
Overflow exception for element 1 

Underflow exception for element 1 

Different from IEEE exception for element 1 
Divide by zero exception for element 1 
Overflow exception for element 2 
Underflow exception for element 2 

Different from IEEE exception for element 2 
Divide by zero exception for element 2 
Overflow exception for element 3 
Underflow exception for element 3 

Different from IEEE exception for element 3 
Divide by zero exception for element 3 
Bitwise OR of all macros for element 0 
Bitwise OR of all macros for element 1 
Bitwise OR of all macros for element 2 
Bitwise OR of all macros for element 3 
Bitwise OR of all macros for element 3 





Table 9-214: Macros for Double Precision Floating-Point Exceptions 





Macro 
FE_OVERFLOW_DBL 
FE_UNDERFLOW_DBL 
FE_INEXACT_DBL 
FE_INVALID_DBL 
FE_NC_NAN_DBL 
FE_NC_DENORM_DBL 
FE_OVERFLOW_DBL_1 
FE_UNDERFLOW_DBL_1 
FE_INEXACT_DBL_1 
FE_INVALID_DBL_1 
FE_NC_NAN_DBL_1 
FE_NC_DENORM_DBL_1 
FE_ALL_EXCEPT_DBL 
FE_ALL_EXCEPT_DBL_1 
FE_ALL_EXCEPT 


Comment 

Overflow exception for element 0 

Underflow exception for element 0 

ISO/IEC inexact for element 0 

ISO/IEC invalid for element 0 

Possibly non-compliant NaN for element 0 
Possibly non-compliant denormal for element 0 
Overflow exception for element 1 

Underflow exception for element 1 

ISO/IEC inexact for element 1 

ISO/IEC invalid for element 1 

Possibly non-compliant NaN for element 1 
Possibly non-compliant denormal for element 1 
Bitwise OR of all macros for element 0 

Bitwise OR of all macros for element 1 

Bitwise OR of all macros from this table 





The floating-point environment variables defined in the C99 specification only apply to double-precision. 





The pragma FENV_ACCESS will be used to inform the compiler whether the program intends to control and test 
floating-point status. If the pragma is on, the compiler will take appropriate action to ensure that code transformations 
preserve the behavior specified in this document. 
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9.2.3. Other Floating-Point Constants in math.h 


Several additional floating-point constants are defined in math .h. These constants are used by functions to report 
various domain and range errors. Many have a non-standard definition for the SPU. A description of these particular 
constants is shown in Table 9-215. 


Table 9-215: Floating-Point Constants 














Macro Description 

HUGE_VAL Infinity 

HUGE_VALF FLT MAX 

HUGE_VALL Infinity 

INFINITY Double precision adheres to the IEEE definition. These macros are not used for single- 
NAN precision operations. 

FP_INFINITE 

FP_NAN ‘ a F : 

EP NORMAL For single precision, the fpclassify () function will only return FP_ NORMAL and 
FP_SUBNORMAL FP_ZERO Classes; FP_NAN, FP_INFINITE, and FP_SUBNORMAL are never generated. 
FP_ZERO 

FP_FAST_FMA 


These are defined to indicate that the fma function executes more quickly than a multiply 


Er el and an add of float and double operands. 


FP_FAST_FMAL 





FP _ILOGBO is the value returned by ilogb (x) and ilogbf (x) if x is zero or a 
FP ILOGBO denorm number. Its value is INT MIN. 
FP_ILOGBNAN FP _ILOGBNAN is the value returned by i logb (x) if x is a NaN. This does not apply to 


the single-precision case of i logbf. Its value is INT MAX. 


MATH_ERRNO 


MATH_ERREXCEPT These will expand to the integer constants 1 and 2, respectively. 





Expands to an expression that has type int and the value MATH ERRNO, 
math_errhandling MATH ERREXCEPT, or the bitwise OR of both. The value of math _errhandling is 
constant for the duration of a program. 




















9.3. Floating-Point Operations 


This section specifies floating-point data conversions, and it describes the overall behavior of C operators and 
standard library functions. It also describes several special cases where floating-point results might vary from the 
IEEE standard. Lastly, the section describes the specific behavior of several specific math functions. 


9.3.1. Floating-Point Conversions 


This section provides specifications for the four types of floating-point data conversions: 1) conversions from integers 
to floating-point; 2) conversions from floating-point to integer; 3) conversion between floating-point precisions; and, 
4) conversions between floating-point and string. 


Integer to Floating-Point Conversions 
Conversions from integers to floats will adhere to the following rules: 
e Asingle-precision conversion from integer to float produces a result within the extended single-precision 
floating-point range. See Table 9-210 for details about this range. 
e Assingle-precision conversion from integer to float rounds towards zero. 


e A double-precision conversion from integer to float produces a result within the C99 standard 
double-precision floating-point range. 


e A double-precision conversion from integer to float rounds according to the rounding mode indicated by the 
value of FLT ROUNDS. 
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Floating-Point to Integer Conversions 


Conversions from floats to integers will have the following behavior: 


When converting from a float to an integer, exceptions are raised for overflow, underflow, and IEEE non- 
compliant result. 

Overflow and underflow exceptions are raised when converting from a double to an integer. If a 
double-precision value is infinite or NaN or if the integral part of the floating value exceeds the range of the 
integer type, an “invalid” floating-point exception is raised, and the resulting value is unspecified. An “inexact” 
floating-point exception is raised by the hardware when a conversion involves an integral floating-point value 
that is outside the range of the integer data type. 


Conversions between Floating-Point Precision 


To achieve maximum performance, compilers only perform conversion from float to double and from double to 
float within the IEEE standard range. These conversions will comply with the IEEE standard, except for denormal 
inputs, which are forced to zero. Conversion of numbers outside of the IEEE standard range is unspecified. 
Conversions with NaNs, infinities, or denormal results are also unspecified. 


Conversions between Floating-Point and Strings 


Conversions between floating-point and string values will adhere to both the extended single-precision floating-point 
range and the IEEE standard double-precision floating-point range. 


9.3.2. Overall Behavior of C Operators and Standard Library Math Functions 


Library functions and compilers will obey the same general rules with respect to rounding and overflow. These rules 
differ, however, depending on whether the code is single precision or double precision. 


Single-Precision Code 


For single precision, the C operators (+, -, *, and /) and the standard library math functions will have the following 
behavior: 


If the operation produces a value with a magnitude greater than the largest positive representable extended- 
precision number, the result will be FLT MAX with appropriate sign, and the overflow flag will be raised. 


For all operators and standard functions, except the negate operator and the fabsf () and copysignf () 
functions, an argument with a denormal value will be treated as +0. 0. 


Except for the negate operator and the fabsf () and copysignf () functions, operators and standard 
functions will never return a denormal value or -0 . 0. 


The negate operator and the fabsf() and copysignf () functions must be implemented such that only the 
sign bit is changed. 


Expressions will be evaluated using the round-towards-zero mode. Implementations that depend on other 
rounding directions for algorithm correctness will produce incorrect results and therefore cannot be used. 


The overflow flag will be set when FLT MAX is returned instead of a value whose magnitude is too large. 
Because infinity is undefined for single precision, FLT MAX will be used to signal infinity in situations where 
infinity would otherwise be generated on an IEEE754-compliant system. This modification will enable 
common trig identities to work. 


NaN is not supported and does not need to be copied from any input parameter. 


By default, compilers may perform optimizations for single-precision floating-point arithmetic that assume 1) 
that NaNs are never given as arguments; and, 2) that +Inf will never be generated as a result. 


Compilers can assume that floating-point operations will not generate user-visible traps, such as division by 
zero, overflow, and underflow. 


Constant expressions that are evaluated at compile time will produce the same result as they would if they 
were evaluated at runtime. For example, 


float x = 6.0e38f * 8.1e30f; 


will be evaluated as FLT MAX. 
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e Compilers may use single-precision contracted operations, such as Floating Reciprocal Absolute Square 
Root Estimate (frsqest) or Floating Multiply and Add (fma), unless explicitly prohibited by FP_CONTRACT 
pragma or a no-fast-float compiler option. When contracted operations are used, errno does not need to be 
set. 


Double-Precision Code 
For double-precision floating-point, the C operators and standard library math functions will be compliant with the 
IEEE standard, with the following exceptions: 

e When a NaN is produced as a result of an operation, it will always be a QNaN. 


e Except for the negate operator and the fabs () and copysign() functions, denormal values will only be 
supported as results. A denormal operand is treated as 0 with same sign as the denormal operand. 


e The default rounding mode for double precision is rounding to nearest. 
e Compilers may use double precision contracted operations, such as Double Floating Multiply and Add (dfma), 


unless explicitly prohibited by the FP_ CONTRACT pragma or a no-fast-double compiler option. When 
contracted operations are used, errno does not need to be set. 


9.3.3. Floating-Point Expression Special Cases 
The C99 standard describes several standard expression transformations that might fail to produce the required 
effect on the SPU: 
e x/2 -> x*0.5 


Valid for this particular value because the value is an exact power of 2, but it is invalid in general (for example, 
x/10 != x*0.1) because the floating-point constant is not exactly representable in any finite base-2 
floating-point system. 


e x*1l -> x and x/l -> x 
Invalid when: 1) x is a SNaN or a non-default QNaN (double precision only); 2) x is a denormal number; or, 3) 
x is -0.0 (single precision only). 

e x/x -> 1.0 


Invalid for single precision when x is zero or a denormal, and invalid for double precision when x is zero, or a 
denormal, Inf, or NaN. 


e x-y -> = (y-x) 


Invalid for zero results which might have different signs, or, for double precision, round to +/- infinity, non-zero 
results might differ by 1 ULP. 


e x-x -> 0.0 
Always valid for single precision, but the equivalence is invalid for double precision when x is either NaN or 
Inf. It is also invalid for double precision for round to —infinity, in which case the result will be -0 . O. 

e 0*x => 0.0 
Always valid for single precision, but invalid for double precision when x is a NaN, Inf, negative number, 
or -0. 

e x+0 -> x 
Invalid in single precision, if x is a denormal operand or -0. Invalid in double precision if x=-0 under round- 
to-nearest, round to +infinity and truncate. Also invalid in double precision if x is a SNaN or non-default QNaN 
and if x is a denormal number, in which case x+0 becomes a zero with appropriate sign. 

e x-0 -> x 
Valid for single precision, except if x is a denormal operand or -0. Invalid for double precision if x is an SNaN 
or non-default ONaN, if x is a denormal number, or if x is +0 and rounding mode is rounding to —infinity. In this 
last case, x-0 = +0-0 =-0. For any normalized operand the result is valid even with round to —infinity. 
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=x => O-x 

Invalid for single precision when x is +0. 0 or a denormal. Invalid for double precision in the following cases: 
1) For NaNs the value of -x is undefined; the result will be different for all NaNs. 2) If x is +0 and the rounding 
mode is rounding to nearest-even, tinfinity, or truncation, 0-x = +0 and -x = -0. 

x!=x -> false 


Always valid for single precision. For double precision, x=NaN always compares unordered, so x!=x -> 
true. 

X==x => true 

Always valid for single precision. For double precision, x=NaN always compares unordered, so x==x -> 
false. 

x<y -> isless(x,y), 

x<=y -> islessequal (x,y), 

x>y -> isgreater(x,y), and 

x>=y -> isgreaterequal (x,y) 


Valid. Exceptions are due to flags that are set as side effects when x or y are NaN under double precision. 
The FENV_ACCESS pragma can change the invalid flag behavior. 














9.3.4. Specific Behavior of Standard Math Functions 


This section describes the specific behavior of various floating-point functions declared in math.h. As noted, the 
SPU hardware has a direct effect on the behavior of floating-point functions. Because of the many differences 
between strict IEEE behavior and the hardware behavior, the standard math functions do not need to provide rigorous 
checks for exception situations and out-of-range conditions. Consequently, the results of many functions are 
redefined. The following is a list of differences: 


The function nanf () will return 0. 
The isnan () macro will always return false for single precision. 


Unlike C99 standard specifications, single-precision versions of nearbyint, lrint, llrint, and fma 
round towards zero. 


Trig, hyperbolic, exponential, logarithmic, and gamma functions do not need to set the inexact flag when 
values are rounded. 


The boundary cases for single-precision versions of frexp (NaN, exp) and modf (NaN, iptr) are not 
defined because these functions propagate and return NaN. 


nextafterf (subnormal, y) will never raise an underflow flag. The functions nextafterf() and 
nexttowardf () will succeed when incrementing past the IEEE maximal float value. 


The following boundary cases will not be supported for single precision because infinity is not a valid 
argument: atanf (+inf), atan2f(+y, +tinf), atan2f(+tinf,x), atan2f(+tinf,+tinf), 

acoshf (+tinf), asinhf (tinf), atanhf (+1), atanhf (+inf), coshf(tinf), sinhf (+inf), 

tanhf (+inf), expf (+inf), exp2f (+inf), expmlf (+inf), frexpf (+inf, &exp), 

ldexpf (tinf,exp), logf (+inf), loglOf (+inf), l1oglpf (+inf), log2f(+inf), logbf (tinf), 
modff(+inf,iptr), scalbnf(+inf,n), cbrtf(tinf), fabsf (inf), hypotf(tinf,y), powf (- 
1,+inf), powf(x,tinf), powf (tinf,y), sqrtf(tinf), erff(tinf), erfcf(+inf), 

lgammaf (tinf), tgammaf (+inf), ceilf(+tinf), floorf(+inf), nearbyintf (+inf), 

roundf (tinf), rintf(tinf), lrintf(tinf), llrintf(+inf), lroundf (+inf), llroundf(+tinf), 
truncf (tinf), fmodf (x,tinf), remainderf(+inf), remquof (tinf),and copysignf(tinf). 





For single precision, the following boundary cases will produce a non-IEEE-compliant result: acosf(|x|>1), 
asinf(|x|>1), acoshf (x<1.0), atanhf(|x|>1), tgammaf(x<0), fmodf (x, 0), 

ldexpf (x,BIG_INT), logf (+0), logf(x<0), logl0f (+0), logl0f(x<0), loglpf(-1), 

loglpf (x<-1), log2f£ (+0), log2f£ (x<0), logbf (+0), powf (+0, y), and tgammaf (+0) 


For single precision, the following boundary cases will not return NaN,: cosf (+inf), sinf (+inf), 
tanf (+inf), tgammaf (-inf), fmodf (+inf, y), nextafterf(x,+tinf), fmaf(+tinf|0,0|+inf,z), 
and fmaf (tinf,0,-+inf). 
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e Section “9.3.1. Floating-Point Conversions” describes the behavior of implicit conversions when a single 


precision value is passed as an argument to a double precision function or when a single precision variable is 
assigned the result of a double-precision function. 
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